Placeholder Image

Subtitles section Play video

  • - Hello?

  • Okay, it's after 12, so I want to get started.

  • So today, lecture eight, we're going to talk about

  • deep learning software.

  • This is a super exciting topic because it changes

  • a lot every year.

  • But also means it's a lot of work to give this lecture

  • 'cause it changes a lot every year.

  • But as usual, a couple administrative notes

  • before we dive into the material.

  • So as a reminder the project proposals for your

  • course projects were due on Tuesday.

  • So hopefully you all turned that in,

  • and hopefully you all have a somewhat good idea

  • of what kind of projects you want to work on

  • for the class.

  • So we're in the process of assigning TA's to projects

  • based on what the project area is

  • and the expertise of the TA's.

  • So we'll have some more information about that

  • in the next couple days I think.

  • We're also in the process of grading assignment one,

  • so stay tuned and we'll get those grades back to you

  • as soon as we can.

  • Another reminder is that assignment two has been out

  • for a while.

  • That's going to be due next week, a week from today, Thursday.

  • And again, when working on assignment two,

  • remember to stop your Google Cloud instances

  • when you're not working to try to preserve your credits.

  • And another bit of confusion, I just wanted to

  • re-emphasize is that for assignment two you really

  • only need to use GPU instances for the last notebook.

  • For all of the several notebooks it's just in Python

  • and Numpy so you don't need any GPUs for those questions.

  • So again, conserve your credits,

  • only use GPUs when you need them.

  • And the final reminder is that the midterm is coming up.

  • It's kind of hard to believe we're there already,

  • but the midterm will be in class on Tuesday, five nine.

  • So the midterm will be more theoretical.

  • It'll be sort of pen and paper working through different

  • kinds of, slightly more theoretical questions

  • to check your understanding of the material that we've

  • covered so far.

  • And I think we'll probably post at least a short sort of

  • sample of the types of questions to expect.

  • Question?

  • [student's words obscured due to lack of microphone]

  • Oh yeah, question is whether it's open-book,

  • so we're going to say closed note, closed book.

  • So just,

  • Yeah, yeah, so that's what we've done in the past

  • is just closed note, closed book, relatively

  • just like want to check that you understand

  • the intuition behind most of the stuff we've presented.

  • So, a quick recap as a reminder of what we were talking

  • about last time.

  • Last time we talked about fancier optimization algorithms

  • for deep learning models including SGD Momentum,

  • Nesterov, RMSProp and Adam.

  • And we saw that these relatively small tweaks

  • on top of vanilla SGD, are relatively easy to implement

  • but can make your networks converge a bit faster.

  • We also talked about regularization,

  • especially dropout.

  • So remember dropout, you're kind of randomly setting

  • parts of the network to zero during the forward pass,

  • and then you kind of marginalize out over that noise

  • in the back at test time.

  • And we saw that this was kind of a general pattern

  • across many different types of regularization

  • in deep learning, where you might add some kind

  • of noise during training, but then marginalize out

  • that noise at test time so it's not stochastic

  • at test time.

  • We also talked about transfer learning where you

  • can maybe download big networks that were pre-trained

  • on some dataset and then fine tune them for your

  • own problem.

  • And this is one way that you can attack a lot of problems

  • in deep learning, even if you don't have a huge

  • dataset of your own.

  • So today we're going to shift gears a little bit

  • and talk about some of the nuts and bolts

  • about writing software and how the hardware works.

  • And a little bit, diving into a lot of details

  • about what the software looks like that you actually

  • use to train these things in practice.

  • So we'll talk a little bit about CPUs and GPUs

  • and then we'll talk about several of the major

  • deep learning frameworks that are out there in use

  • these days.

  • So first, we've sort of mentioned this off hand

  • a bunch of different times,

  • that computers have CPUs, computers have GPUs.

  • Deep learning uses GPUs, but we weren't really

  • too explicit up to this point about what exactly

  • these things are and why one might be better

  • than another for different tasks.

  • So, who's built a computer before?

  • Just kind of show of hands.

  • So, maybe about a third of you, half of you,

  • somewhere around that ballpark.

  • So this is a shot of my computer at home

  • that I built.

  • And you can see that there's a lot of stuff going on

  • inside the computer, maybe, hopefully you know

  • what most of these parts are.

  • And the CPU is the Central Processing Unit.

  • That's this little chip hidden under this cooling fan

  • right here near the top of the case.

  • And the CPU is actually relatively small piece.

  • It's a relatively small thing inside the case.

  • It's not taking up a lot of space.

  • And the GPUs are these two big monster things

  • that are taking up a gigantic amount of space

  • in the case.

  • They have their own cooling,

  • they're taking a lot of power.

  • They're quite large.

  • So, just in terms of how much power they're using,

  • in terms of how big they are, the GPUs are kind of

  • physically imposing and taking up a lot of space

  • in the case.

  • So the question is what are these things

  • and why are they so important for deep learning?

  • Well, the GPU is called a graphics card,

  • or Graphics Processing Unit.

  • And these were really developed, originally for rendering

  • computer graphics, and especially around games

  • and that sort of thing.

  • So another show of hands, who plays video games at home

  • sometimes, from time to time on their computer?

  • Yeah, so again, maybe about half, good fraction.

  • So for those of you who've played video games before

  • and who've built your own computers,

  • you probably have your own opinions on this debate.

  • [laughs]

  • So this is one of those big debates in computer science.

  • You know, there's like Intel versus AMD,

  • NVIDIA versus AMD for graphics cards.

  • It's up there with Vim versus Emacs for text editor.

  • And pretty much any gamer has their own opinions

  • on which of these two sides they prefer

  • for their own cards.

  • And in deep learning we kind of have mostly picked

  • one side of this fight, and that's NVIDIA.

  • So if you guys have AMD cards,

  • you might be in a little bit more trouble if you want

  • to use those for deep learning.

  • And really, NVIDIA's been pushing a lot for deep learning

  • in the last several years.

  • It's been kind of a large focus of some of their strategy.

  • And they put in a lot effort into engineering

  • sort of good solutions to make their hardware

  • better suited for deep learning.

  • So most people in deep learning when we talk about GPUs,

  • we're pretty much exclusively talking about NVIDIA GPUs.

  • Maybe in the future this'll change a little bit,

  • and there might be new players coming up,

  • but at least for now NVIDIA is pretty dominant.

  • So to give you an idea of like what is the difference

  • between a CPU and a GPU, I've kind of made a little

  • spread sheet here.

  • On the top we have two of the kind of top end Intel

  • consumer CPUs, and on the bottom we have two of

  • NVIDIA's sort of current top end consumer GPUs.

  • And there's a couple general trends to notice here.

  • Both GPUs and CPUs are kind of a general purpose

  • computing machine where they can execute programs

  • and do sort of arbitrary instructions,

  • but they're qualitatively pretty different.

  • So CPUs tend to have just a few cores,

  • for consumer desktop CPUs these days,

  • they might have something like four or six

  • or maybe up to 10 cores.

  • With hyperthreading technology that means they can run,

  • the hardware can physically run, like maybe eight

  • or up to 20 threads concurrently.

  • So the CPU can maybe do 20 things in parallel at once.

  • So that's just not a gigantic number,

  • but those threads for a CPU are pretty powerful.

  • They can actually do a lot of things,

  • they're very fast.

  • Every CPU instruction can actually do quite a lot

  • of stuff.

  • And they can all work pretty independently.

  • For GPUs it's a little bit different.

  • So for GPUs we see that these sort of common top end

  • consumer GPUs have thousands of cores.

  • So the NVIDIA Titan XP which is the current

  • top of the line consumer GPU has 3840 cores.

  • So that's a crazy number.

  • That's like way more than the 10 cores that you'll get

  • for a similarly priced CPU.

  • The downside of a GPU is that each of those cores,

  • one, it runs at a much slower clock speed.

  • And two they really can't do quite as much.

  • You can't really compare CPU cores and GPU cores

  • apples to apples.

  • The GPU cores can't really operate very independently.

  • They all kind of need to work together

  • and sort of paralyze one task across many cores

  • rather than each core totally doing its own thing.

  • So you can't really compare these numbers directly.

  • But it should give you the sense that due

  • to the large number of cores GPUs can sort of,

  • are really good for parallel things where you

  • need to do a lot of things all at the same time,

  • but those things are all pretty much the same flavor.

  • Another thing to point out between CPUs and GPUs

  • is this idea of memory.

  • Right, so CPUs have some cache on the CPU,

  • but that's relatively small and the majority

  • of the memory for your CPU is pulling from your

  • system memory, the RAM, which will maybe be like

  • eight, 12, 16, 32 gigabytes of RAM on a typical

  • consumer desktop these days.

  • Whereas GPUs actually have their own RAM built

  • into the chip.

  • There's a pretty large bottleneck communicating

  • between the RAM in your system and the GPU,

  • so the GPUs typically have their own

  • relatively large block of memory within the card itself.

  • And for the Titan XP, which again is maybe the current

  • top of the line consumer card,

  • this thing has 12 gigabytes of memory local to the GPU.

  • GPUs also have their own caching system

  • where there are sort of multiple hierarchies of caching

  • between the 12 gigabytes of GPU memory

  • and the actual GPU cores.

  • And that's somewhat similar to the caching hierarchy

  • that you might see in a CPU.

  • So, CPUs are kind of good for general purpose processing.

  • They can do a lot of different things.

  • And GPUs are maybe more specialized for these highly

  • paralyzable algorithms.

  • So the prototypical algorithm of something that works

  • really really well and is like perfectly suited

  • to a GPU is matrix multiplication.

  • So remember in matrix multiplication on the left

  • we've got like a matrix composed of a bunch of rows.

  • We multiply that on the right by another matrix composed

  • of a bunch of columns and then this produces

  • another, a final matrix where each element in the

  • output matrix is a dot product between one of the rows

  • and one of the columns of the two input matrices.

  • And these dot products are all independent.

  • Like you could imagine, for this output matrix

  • you could split it up completely

  • and have each of those different elements

  • of the output matrix all being computed in parallel

  • and they all sort of are running the same computation

  • which is taking a dot product of these two vectors.

  • But exactly where they're reading that data from

  • is from different places in the two input matrices.

  • So you could imagine that for a GPU you can just

  • like blast this out and have all of this elements

  • of the output matrix all computed in parallel

  • and that could make this thing computer super super fast

  • on GPU.

  • So that's kind of the prototypical type of problem

  • that like where a GPU is really well suited,

  • where a CPU might have to go in and step through

  • sequentially and compute each of these elements

  • one by one.

  • That picture is a little bit of a caricature because

  • CPUs these days have multiple cores,

  • they can do vectorized instructions as well,

  • but still, for these like massively parallel problems

  • GPUs tend to have much better throughput.

  • Especially when these matrices get really really big.

  • And by the way, convolution is kind of the same

  • kind of story.

  • Where you know in convolution we have this input tensor,

  • we have this weight tensor and then every point in the

  • output tensor after a convolution is again some inner

  • product between some part of the weights

  • and some part of the input.

  • And you can imagine that a GPU could really paralyze

  • this computation, split it all up across the many cores

  • and compute it very quickly.

  • So that's kind of the general flavor of the types

  • of problems where GPUs give you a huge speed advantage

  • over CPUs.

  • So you can actually write programs that run directly

  • on GPUs.

  • So NVIDIA has this CUDA abstraction that lets you write

  • code that kind of looks like C,

  • but executes directly on the GPUs.

  • But CUDA code is really really tricky.

  • It's actually really tough to write CUDA code that's

  • performant and actually squeezes all the juice out

  • of these GPUs.

  • You have to be very careful managing the memory hierarchy

  • and making sure you don't have cache misses

  • and branch mispredictions and all that sort of stuff.

  • So it's actually really really hard to write performant

  • CUDA code on your own.

  • So as a result NVIDIA has released a lot of libraries

  • that implement common computational primitives

  • that are very very highly optimized for GPUs.

  • So for example NVIDIA has a cuBLAS library that implements

  • different kinds of matrix multiplications

  • and different matrix operations that are super optimized,

  • run really well on GPU, get very close to sort of

  • theoretical peak hardware utilization.

  • Similarly they have a cuDNN library which implements

  • things like convolution, forward and backward passes,

  • batch normalization, recurrent networks,

  • all these kinds of computational primitives

  • that we need in deep learning.

  • NVIDIA has gone in there and released their own binaries

  • that compute these primitives very efficiently

  • on NVIDIA hardware.

  • So in practice, you tend not to end up writing your own

  • CUDA code for deep learning.

  • You typically are just mostly calling into existing

  • code that other people have written.

  • Much of which is the stuff which has been heavily

  • optimized by NVIDIA already.

  • There's another sort of language called OpenCL

  • which is a bit more general.

  • Runs on more than just NVIDIA GPUs,

  • can run on AMD hardware, can run on CPUs,

  • but OpenCL, nobody's really spent a really large amount

  • of effort and energy trying to get optimized deep learning

  • primitives for OpenCL, so it tends to be a lot less

  • performant the super optimized versions in CUDA.

  • So maybe in the future we might see a bit of a more open

  • standard and we might see this across many different

  • more types of platforms, but at least for now,

  • NVIDIA's kind of the main game in town for deep learning.

  • So you can check, there's a lot of different resources

  • for learning about how you can do GPU programming yourself.

  • It's kind of fun.

  • It's sort of a different paradigm of writing code

  • because it's this massively parallel architecture,

  • but that's a bit beyond the scope of this course.

  • And again, you don't really need to write your own

  • CUDA code much in practice for deep learning.

  • And in fact, I've never written my own CUDA code

  • for any research project, so,

  • but it is kind of useful to know like how it works

  • and what are the basic ideas even if you're not

  • writing it yourself.

  • So if you want to look at kind of CPU GPU performance

  • in practice, I did some benchmarks last summer

  • comparing a decent Intel CPU

  • against a bunch of different GPUs that were sort

  • of near top of the line at that time.

  • And these were my own benchmarks that you can find

  • more details on GitHub, but my findings were that

  • for things like VGG 16 and 19, ResNets, various ResNets,

  • then you typically see something like a 65 to 75 times

  • speed up when running the exact same computation

  • on a top of the line GPU, in this case a Pascal Titan X,

  • versus a top of the line, well, not quite top of the line

  • CPU, which in this case was an Intel E5 processor.

  • Although, I'd like to make one sort of caveat here

  • is that you always need to be super careful

  • whenever you're reading any kind of benchmarks

  • about deep learning, because it's super easy to be

  • unfair between different things.

  • And you kind of need to know a lot of the details about

  • what exactly is being benchmarked in order to know

  • whether or not the comparison is fair.

  • So in this case I'll come right out and tell you

  • that probably this comparison is a little bit unfair

  • to CPU because I didn't spend a lot of effort

  • trying to squeeze the maximal performance

  • out of CPUs.

  • I probably could have tuned the blast libraries better

  • for the CPU performance.

  • And I probably could have gotten these numbers

  • a bit better.

  • This was sort of out of the box performance

  • between just installing Torch, running it on a CPU,

  • just installing Torch running it on a GPU.

  • So this is kind of out of the box performance,

  • but it's not really like peak, possible, theoretical

  • throughput on the CPU.

  • But that being said, I think there are still pretty

  • substantial speed ups to be had here.

  • Another kind of interesting outcome from this benchmarking

  • was comparing these optimized cuDNN libraries

  • from NVIDIA for convolution and whatnot versus

  • sort of more naive CUDA that had been hand written

  • out in the open source community.

  • And you can see that if you compare the same networks

  • on the same hardware with the same deep learning

  • framework and the only difference is swapping out

  • these cuDNN versus sort of hand written, less optimized

  • CUDA you can see something like nearly a three X speed up

  • across the board when you switch from the relatively

  • simple CUDA to these like super optimized cuDNN

  • implementations.

  • So in general, whenever you're writing code on GPU,

  • you should probably almost always like just make sure

  • you're using cuDNN because you're leaving probably

  • a three X performance boost on the table if you're

  • not calling into cuDNN for your stuff.

  • So another problem that comes up in practice,

  • when you're training these things is that

  • you know, your model is maybe sitting on the GPU,

  • the weights of the model are in that 12 gigabytes

  • of local storage on the GPU, but your big dataset

  • is sitting over on the right on a hard drive

  • or an SSD or something like that.

  • So if you're not careful you can actually bottleneck

  • your training by just trying to read the data

  • off the disk.

  • 'Cause the GPU is super fast, it can compute

  • forward and backward quite fast, but if you're reading

  • sequentially off a spinning disk, you can actually

  • bottleneck your training quite,

  • and that can be really bad and slow you down.

  • So some solutions here are that like you know

  • if your dataset's really small, sometimes you might just

  • read the whole dataset into RAM.

  • Or even if your dataset isn't so small,

  • but you have a giant server with a ton of RAM,

  • you might do that anyway.

  • You can also make sure you're using an SSD instead

  • of a hard drive, that can help a lot with read throughput.

  • Another common strategy is to use multiple threads

  • on the CPU that are pre-fetching data off RAM

  • or off disk, buffering it in memory, in RAM so that

  • then you can continue feeding that buffer data down

  • to the GPU with good performance.

  • This is a little bit painful to set up,

  • but again like, these GPU's are so fast that

  • if you're not really careful with trying to feed

  • them data as quickly as possible,

  • just reading the data can sometimes bottleneck

  • the whole training process.

  • So that's something to be aware of.

  • So that's kind of the brief introduction to like

  • sort of GPU CPU hardware in practice when it comes

  • to deep learning.

  • And then I wanted to switch gears a little bit

  • and talk about the software side of things.

  • The various deep learning frameworks that people are using

  • in practice.

  • But I guess before I move on,

  • is there any sort of questions about CPU GPU?

  • Yeah, question?

  • [student's words obscured due to lack of microphone]

  • Yeah, so the question is what can you sort of,

  • what can you do mechanically when you're coding

  • to avoid these problems?

  • Probably the biggest thing you can do in software

  • is set up sort of pre-fetching on the CPU.

  • Like you couldn't like, sort of a naive thing

  • would be you have this sequential process where you

  • first read data off disk, wait for the data,

  • wait for the minibatch to be read,

  • then feed the minibatch to the GPU,

  • then go forward and backward on the GPU,

  • then read another minibatch and sort of do this all

  • in sequence.

  • And if you actually have multiple,

  • like instead you might have CPU threads running in the

  • background that are fetching data off the disk

  • such that while the,

  • you can sort of interleave all of these things.

  • Like the GPU is computing,

  • the CPU background threads are feeding data off disk

  • and your main thread is kind of waiting for these things to,

  • just doing a bit of synchronization between these things

  • so they're all happening in parallel.

  • And thankfully if you're using some of these deep learning

  • frameworks that we're about to talk about,

  • then some of this work has already been done for you

  • 'cause it's a little bit painful.

  • So the landscape of deep learning frameworks

  • is super fast moving.

  • So last year when I gave this lecture I talked mostly

  • about Caffe, Torch, Theano and TensorFlow.

  • And when I last gave this talk, again more than a year ago,

  • TensorFlow was relatively new.

  • It had not seen super widespread adoption yet at that time.

  • But now I think in the last year TensorFlow

  • has gotten much more popular.

  • It's probably the main framework of choice for many people.

  • So that's a big change.

  • We've also seen a ton of new frameworks

  • sort of popping up like mushrooms in the last year.

  • So in particular Caffe2 and PyTorch are new frameworks

  • from Facebook that I think are pretty interesting.

  • There's also a ton of other frameworks.

  • Paddle, Baidu has Paddle, Microsoft has CNTK,

  • Amazon is mostly using MXNet and there's a ton

  • of other frameworks as well, but I'm less familiar with,

  • and really don't have time to get into.

  • But one interesting thing to point out from this picture

  • is that kind of the first generation of deep learning

  • frameworks that really saw wide adoption

  • were built in academia.

  • So Caffe was from Berkeley, Torch was developed

  • originally NYU and also in collaboration with Facebook.

  • And Theana was mostly build at the University of Montreal.

  • But these kind of next generation deep learning

  • frameworks all originated in industry.

  • So Caffe2 is from Facebook, PyTorch is from Facebook.

  • TensorFlow is from Google.

  • So it's kind of an interesting shift that we've seen

  • in the landscape over the last couple of years

  • is that these ideas have really moved a lot

  • from academia into industry.

  • And now industry is kind of giving us these big powerful

  • nice frameworks to work with.

  • So today I wanted to mostly talk about PyTorch

  • and TensorFlow 'cause I personally think that those

  • are probably the ones you should be focusing on for

  • a lot of research type problems these days.

  • I'll also talk a bit about Caffe and Caffe2.

  • But probably a little bit less emphasis on those.

  • And before we move any farther, I thought I should make

  • my own biases a little bit more explicit.

  • So I have mostly, I've worked with Torch mostly

  • for the last several years.

  • And I've used it quite a lot, I like it a lot.

  • And then in the last year I've mostly switched to PyTorch

  • as my main research framework.

  • So I have a little bit less experience with some

  • of these others, especially TensorFlow,

  • but I'll still try to do my best to give you a fair

  • picture and a decent overview of these things.

  • So, remember that in the last several lectures

  • we've hammered this idea of computational graphs in

  • sort of over and over.

  • That whenever you're doing deep learning,

  • you want to think about building some computational graph

  • that computes whatever function that you want to compute.

  • So in the case of a linear classifier you'll combine

  • your data X and your weights W with a matrix multiply.

  • You'll do some kind of hinge loss to maybe have,

  • compute your loss.

  • You'll have some regularization term

  • and you imagine stitching together all these different

  • operations into some graph structure.

  • Remember that these graph structures can get pretty

  • complex in the case of a big neural net,

  • now there's many different layers,

  • many different activations.

  • Many different weights spread all around in a pretty

  • complex graph.

  • And as you move to things like neural turing machines

  • then you can get these really crazy computational graphs

  • that you can't even really draw because they're

  • so big and messy.

  • So the point of deep learning frameworks is really,

  • there's really kind of three main reasons why you might

  • want to use one of these deep learning frameworks

  • rather than just writing your own code.

  • So the first would be that these frameworks enable

  • you to easily build and work with these big hairy

  • computational graphs without kind of worrying

  • about a lot of those bookkeeping details yourself.

  • Another major idea is that,

  • whenever we're working in deep learning

  • we always need to compute gradients.

  • We're always computing some loss,

  • we're always computer gradient of our weight

  • with respect to the loss.

  • And we'd like to make this automatically computing gradient,

  • you don't want to have to write that code yourself.

  • You want that framework to handle all these back propagation

  • details for you so you can just think about

  • writing down the forward pass of your network

  • and have the backward pass sort of come out for free

  • without any additional work.

  • And finally you want all this stuff to run efficiently

  • on GPUs so you don't have to worry too much about these

  • low level hardware details about cuBLAS and cuDNN

  • and CUDA and moving data between the CPU and GPU memory.

  • You kind of want all those messy details to be taken care of

  • for you.

  • So those are kind of some of the major reasons

  • why you might choose to use frameworks rather than

  • writing your own stuff from scratch.

  • So as kind of a concrete example of a computational graph

  • we can maybe write down this super simple thing.

  • Where we have three inputs, X, Y, and Z.

  • We're going to combine X and Y to produce A.

  • Then we're going to combine A and Z to produce B

  • and then finally we're going to do some maybe summing out

  • operation on B to give some scaler final result C.

  • So you've probably written enough Numpy code at this point

  • to realize that it's super easy to write down,

  • to implement this computational graph,

  • or rather to implement this bit of computation in Numpy,

  • right?

  • You can just kind of write down in Numpy that you want to

  • generate some random data, you want to multiply two things,

  • you want to add two things, you want to sum out a couple things.

  • And it's really easy to do this in Numpy.

  • But then the question is like suppose that we want

  • to compute the gradient of C with respect to X, Y, and Z.

  • So, if you're working in Numpy, you kind of need to write out

  • this backward pass yourself.

  • And you've gotten a lot of practice with this on the

  • homeworks, but it can be kind of a pain

  • and a little bit annoying and messy once you get to

  • really big complicated things.

  • The other problem with Numpy is that it doesn't run

  • on the GPU.

  • So Numpy is definitely CPU only.

  • And you're never going to be able to experience

  • or take advantage of these GPU accelerated speedups

  • if you're stuck working in Numpy.

  • And it's, again, it's a pain to have to compute

  • your own gradients in all these situations.

  • So, kind of the goal of most deep learning frameworks

  • these days is to let you write code in the forward pass

  • that looks very similar to Numpy,

  • but lets you run it on the GPU

  • and lets you automatically compute gradients.

  • And that's kind of the big picture goal of most of these

  • frameworks.

  • So if you imagine looking at, if we look at an example

  • in TensorFlow of the exact same computational graph,

  • we now see that in this forward pass,

  • you write this code that ends up looking very very similar

  • to the Numpy forward pass where you're kind of doing

  • these multiplication and these addition operations.

  • But now TensorFlow has this magic line that just

  • computes all the gradients for you.

  • So now you don't have go in and write your own backward pass

  • and that's much more convenient.

  • The other nice thing about TensorFlow is you can really

  • just, like with one line you can switch all this computation

  • between CPU and GPU.

  • So here, if you just add this with statement

  • before you're doing this forward pass,

  • you just can explicitly tell the framework,

  • hey I want to run this code on the CPU.

  • But now if we just change that with statement a little bit

  • with just with a one character change in this case,

  • changing that C to a G, now the code runs on GPU.

  • And now in this little code snippet,

  • we've solved these two problems.

  • We're running our code on the GPU

  • and we're having the framework compute all the gradients

  • for us, so that's really nice.

  • And PyTorch kind looks almost exactly the same.

  • So again, in PyTorch you kind of write down,

  • you define some variables,

  • you have some forward pass and the forward pass again

  • looks very similar to like, in this case identical

  • to the Numpy code.

  • And then again, you can just use PyTorch to compute

  • gradients, all your gradients with just one line.

  • And now in PyTorch again, it's really easy to switch

  • to GPU, you just need to cast all your stuff to the

  • CUDA data type before you rung your computation

  • and now everything runs transparently on the GPU for you.

  • So if you kind of just look at these three examples,

  • these three snippets of code side by side,

  • the Numpy, the TensorFlow and the PyTorch

  • you see that the TensorFlow and the PyTorch code

  • in the forward pass looks almost exactly like Numpy

  • which is great 'cause Numpy has a beautiful API,

  • it's really easy to work with.

  • But we can compute gradients automatically

  • and we can run the GPU automatically.

  • So after that kind of introduction,

  • I wanted to dive in and talk in a little bit more

  • detail about kind of what's going on inside this

  • TensorFlow example.

  • So as a running example throughout the rest of the lecture,

  • I'm going to use the training a two-layer fully connected

  • ReLU network on random data as kind of a running example

  • throughout the rest of the examples here.

  • And we're going to train this thing with an L2 Euclidean

  • loss on random data.

  • So this is kind of a silly network, it's not really doing

  • anything useful, but it does give you the,

  • it's relatively small, self contained,

  • the code fits on the slide without being too small,

  • and it lets you demonstrate kind of a lot of the useful

  • ideas inside these frameworks.

  • So here on the right, oh, and then another note,

  • I'm kind of assuming that Numpy and TensorFlow

  • have already been imported in all these code snippets.

  • So in TensorFlow you would typically divide your computation

  • into two major stages.

  • First, we're going to write some code that defines

  • our computational graph, and that's this red code

  • up in the top half.

  • And then after you define your graph,

  • you're going to run the graph over and over again

  • and actually feed data into the graph

  • to perform whatever computation you want it to perform.

  • So this is the really, this is kind of the big

  • common pattern in TensorFlow.

  • You'll first have a bunch of code that builds the graph

  • and then you'll go and run the graph and reuse it

  • many many times.

  • So if you kind of dive into the code of building

  • the graph in this case.

  • Up at the top you see that we're defining this X, Y,

  • w1 and w2, and we're creating these tf.placeholder objects.

  • So these are going to be input nodes to the graph.

  • These are going to be sort of entry points to the graph

  • where when we run the graph, we're going to feed in data

  • and put them in through these input slots in our

  • computational graph.

  • So this is not actually like allocating any memory

  • right now.

  • We're just sort of setting up these input slots

  • to the graph.

  • Then we're going to use those input slots which are now

  • kind of like these symbolic variables

  • and we're going to perform different TensorFlow operations

  • on these symbolic variables in order to set up

  • what computation we want to run on those variables.

  • So in this case we're doing a matrix multiplication

  • between X and w1, we're doing some tf.maximum to do a

  • ReLU nonlinearity and then we're doing another

  • matrix multiplication to compute our output predictions.

  • And then we're again using a sort of basic Tensor

  • operations to compute our Euclidean distance,

  • our L2 loss between our prediction and the target Y.

  • Another thing to point out here is that

  • these lines of code are not actually computing anything.

  • There's no data in the system right now.

  • We're just building up this computational graph data

  • structure telling TensorFlow which operations

  • we want to eventually run once we put in real data.

  • So this is just building the graph,

  • this is not actually doing anything.

  • Then we have this magical line where after we've computed

  • our loss with these symbolic operations,

  • then we can just ask TensorFlow to compute

  • the gradient of the loss with respect to w1 and w2

  • in this one magical, beautiful line.

  • And this avoids you writing all your own backprop code

  • that you had to do in the assignments.

  • But again there's no actual computation happening here.

  • This is just sort of adding extra operations

  • to the computational graph where now the computational

  • graph has these additional operations which will end up

  • computing these gradients for you.

  • So now at this point we've computed our computational

  • graph, we have this big graph in this graph data structure

  • in memory that knows what operations we want to perform

  • to compute the loss in gradients.

  • And now we enter a TensorFlow session to actually run

  • this graph and feed it with data.

  • So then, once we've entered the session,

  • then we actually need to construct some concrete values

  • that will be fed to the graph.

  • So TensorFlow just expects to receive data from

  • Numpy arrays in most cases.

  • So here we're just creating concrete actual values

  • for X, Y, w1 and w2 using Numpy and then storing these

  • in some dictionary.

  • And now here is where we're actually running the graph.

  • So you can see that we're calling a session.run

  • to actually execute some part of the graph.

  • The first argument loss, tells us which part of the graph

  • do we actually want as output.

  • And that, so we actually want the graph,

  • in this case we need to tell it that we actually

  • want to compute loss and grad1 and grad w2

  • and we need to pass in with this feed dict parameter

  • the actual concrete values that will be fed to the graph.

  • And then after, in this one line,

  • it's going and running the graph and then computing

  • those values for loss grad1 to grad w2

  • and then returning the actual concrete values

  • for those in Numpy arrays again.

  • So now after you unpack this output in the second line,

  • you get Numpy arrays, or you get Numpy arrays with the loss

  • and the gradients.

  • So then you can go and do whatever you want

  • with these values.

  • So then, this has only run sort of one forward and backward

  • pass through our graph,

  • and it only takes a couple extra lines if we actually

  • want to train the network.

  • So here we're, now we're running the graph many times

  • in a loop so we're doing a four loop

  • and in each iteration of the loop,

  • we're calling session.run asking it to compute

  • the loss and the gradients.

  • And now we're doing a manual gradient discent step

  • using those computed gradients to now update our current

  • values of the weights.

  • So if you actually run this code and plot the losses,

  • then you'll see that the loss goes down

  • and the network is training and this is working pretty well.

  • So this is kind of like a super bare bones example

  • of training a fully connected network in TensorFlow.

  • But there's a problem here.

  • So here, remember that on the forward pass,

  • every time we execute this graph,

  • we're actually feeding in the weights.

  • We have the weights as Numpy arrays

  • and we're explicitly feeding them into the graph.

  • And now when the graph finishes executing

  • it's going to give us these gradients.

  • And remember the gradients are the same size

  • as the weights.

  • So this means that every time we're running the graph here,

  • we're copying the weights from Numpy arrays into

  • TensorFlow then getting the gradients

  • and then copying the gradients from TensorFlow

  • back out to Numpy arrays.

  • So if you're just running on CPU,

  • this is maybe not a huge deal,

  • but remember we talked about CPU GPU bottleneck

  • and how it's very expensive actually to copy data

  • between CPU memory and GPU memory.

  • So if your network is very large and your weights

  • and gradients were very big,

  • then doing something like this would be super expensive

  • and super slow because we'd be copying all kinds of data

  • back and forth between the CPU and the GPU at every

  • time step.

  • So that's bad, we don't want to do that.

  • We need to fix that.

  • So, obviously TensorFlow has some solution to this.

  • And the idea is that now we want our weights,

  • w1 and w2, rather than being placeholders where we're

  • going to, where we expect to feed them in to the network

  • on every forward pass, instead we define them as variables.

  • So a variable is something is a value that lives inside

  • the computational graph and it's going to persist

  • inside the computational graph across different times

  • when you run the same graph.

  • So now instead of declaring these w1 and w2 as placeholders,

  • instead we just construct them as variables.

  • But now since they live inside the graph,

  • we also need to tell TensorFlow how they should be

  • initialized, right?

  • Because in the previous case we were feeding in

  • their values from outside the graph,

  • so we initialized them in Numpy,

  • but now because these things live inside the graph,

  • TensorFlow is responsible for initializing them.

  • So we need to pass in a tf.randomnormal operation,

  • which again is not actually initializing them

  • when we run this line, this is just telling TensorFlow

  • how we want them to be initialized.

  • So it's a little bit of confusing misdirection

  • going on here.

  • And now, remember in the previous example

  • we were actually updating the weights outside

  • of the computational graph.

  • We, in the previous example, we were computing the gradients

  • and then using them to update the weights as Numpy arrays

  • and then feeding in the updated weights at the next

  • time step.

  • But now because we want these weights to live inside

  • the graph, this operation of updating the weights

  • needs to also be an operation inside

  • the computational graph.

  • So now we used this assign function which mutates

  • these variables inside the computational graph

  • and now the mutated value will persist across multiple runs

  • of the same graph.

  • So now when we run this graph

  • and when we train the network,

  • now we need to run the graph once with a little bit of

  • special incantation to tell TensorFlow to set up these

  • variables that are going to live inside the graph.

  • And then once we've done that initialization,

  • now we can run the graph over and over again.

  • And here, we're now only feeding in the data and labels

  • X and Y and the weights are living inside the graph.

  • And here we've asked the network to,

  • we've asked TensorFlow to compute the loss for us.

  • And then you might think that this would train the network,

  • but there's actually a bug here.

  • So, if you actually run this code,

  • and you plot the loss, it doesn't train.

  • So that's bad, it's confusing, like what's going on?

  • We wrote this assign code, we ran the thing,

  • like we computed the loss and the gradients

  • and our loss is flat, what's going on?

  • Any ideas?

  • [student's words obscured due to lack of microphone]

  • Yeah so one hypothesis is that maybe we're accidentally

  • re-initializing the w's every time we call the graph.

  • That's a good hypothesis, that's actually not the problem

  • in this case.

  • [student's words obscured due to lack of microphone]

  • Yeah, so the answer is that we actually need to explicitly

  • tell TensorFlow that we want to run these new w1

  • and new w2 operations.

  • So we've built up this big computational graph data

  • structure in memory and now when we call run,

  • we only told TensorFlow that we wanted to compute loss.

  • And if you look at the dependencies among these different

  • operations inside the graph,

  • you see that in order to compute loss

  • we don't actually need to perform this update operation.

  • So TensorFlow is smart and it only computes the parts

  • of the graph that are necessary for computing the output

  • that you asked it to compute.

  • So that's kind of a nice thing because it means it's only

  • doing as much work as it needs to,

  • but in situations like this it can be a little bit confusing

  • and lead to behavior that you didn't expect.

  • So the solution in this case is that we actually need to

  • explicitly tell TensorFlow to perform those

  • update operations.

  • So one thing we could do, which is what was suggested

  • is we could add new w1 and new w2 as outputs

  • and just tell TensorFlow that we want to produce

  • these values as outputs.

  • But that's a problem too because the values,

  • those new w1, new w2 values are again these big tensors.

  • So now if we tell TensorFlow we want those as output,

  • we're going to again get this copying behavior

  • between CPU and GPU at ever iteration.

  • So that's bad, we don't want that.

  • So there's a little trick you can do instead.

  • Which is that we add kind of a dummy node to the graph.

  • With these fake data dependencies

  • and we just say that this dummy node updates,

  • has these data dependencies of new w1 and new w2.

  • And now when we actually run the graph,

  • we tell it to compute both the loss and this dummy node.

  • And this dummy node doesn't actually return

  • any value it just returns none, but because of this

  • dependency that we've put into the node it ensures

  • that when we run the updates value,

  • we actually also run these update operations.

  • So, question?

  • [student's words obscured due to lack of microphone]

  • Is there a reason why we didn't put X and Y into the graph?

  • And that it stayed as Numpy.

  • So in this example we're reusing X and Y on every,

  • we're reusing the same X and Y on every iteration.

  • So you're right, we could have just also stuck those

  • in the graph, but in a more realistic scenario,

  • X and Y will be minibatches of data so those will actually

  • change at every iteration and we will want to feed

  • different values for those at every iteration.

  • So in this case, they could have stayed in the graph,

  • but in most cases they will change,

  • so we don't want them to live in the graph.

  • Oh, another question?

  • [student's words obscured due to lack of microphone]

  • Yeah, so we've told it, we had put into TensorFlow

  • that the outputs we want are loss and updates.

  • Updates is not actually a real value.

  • So when updates evaluates it just returns none.

  • But because of this dependency we've told it that updates

  • depends on these assign operations.

  • But these assign operations live inside

  • the computational graph and all live inside GPU memory.

  • So then we're doing these update operations

  • entirely on the GPU and we're no longer copying the

  • updated values back out of the graph.

  • [student's words obscured due to lack of microphone]

  • So the question is does tf.group return none?

  • So this gets into the trickiness of TensorFlow.

  • So tf.group returns some crazy TensorFlow value.

  • It sort of returns some like internal TensorFlow node

  • operation that we need to continue building the graph.

  • But when you execute the graph,

  • and when you tell, inside the session.run,

  • when we told it we want it to compute the concrete value

  • from updates, then that returns none.

  • So whenever you're working with TensorFlow

  • you have this funny indirection between building the graph

  • and the actual output values during building the graph

  • is some funny weird object, and then you actually get

  • a concrete value when you run the graph.

  • So here after you run updates, then the output is none.

  • Does that clear it up a little bit?

  • [student's words obscured due to lack of microphone]

  • So the question is why is loss a value

  • and why is updates none?

  • That's just the way that updates works.

  • So loss is a value when we compute,

  • when we tell TensorFlow we want to run a tensor,

  • then we get the concrete value.

  • Updates is this kind of special other data type

  • that does not return a value, it instead returns none.

  • So it's kind of some TensorFlow magic that's going on there.

  • Maybe we can talk offline if you're still confused.

  • [student's words obscured due to lack of microphone]

  • Yeah, yeah, that behavior is coming from the group method.

  • So now, we kind of have this weird pattern where we

  • wanted to do these different assign operations,

  • we have to use this funny tf.group thing.

  • That's kind of a pain, so thankfully TensorFlow gives

  • you some convenience operations that kind of do that

  • kind of stuff for you.

  • And that's called an optimizer.

  • So here we're using a tf.train.GradientDescentOptimizer

  • and we're telling it what learning rate we want to use.

  • And you can imagine that there's, there's RMSprop,

  • there's all kinds of different optimization algorithms here.

  • And now we call optimizer.minimize of loss

  • and now this is a pretty magical,

  • this is a pretty magical thing,

  • because now this call is aware that these variables

  • w1 and w2 are marked as trainable by default,

  • so then internally, inside this optimizer.minimize

  • it's going in and adding nodes to the graph

  • which will compute gradient of loss with respect

  • to w1 and w2 and then it's also performing that update

  • operation for you and it's doing the grouping operation

  • for you and it's doing the assigns.

  • It's like doing a lot of magical stuff inside there.

  • But then it ends up giving you this magical updates value

  • which, if you dig through the code they're actually using

  • tf.group so it looks very similar internally to what

  • we saw before.

  • And now when we run the graph inside our loop

  • we do the same pattern of telling it to compute loss

  • and updates.

  • And every time we tell the graph to compute updates,

  • then it'll actually go and update the graph.

  • Question?

  • [student's words obscured due to lack of microphone]

  • Yeah, so what is the tf.GlobalVariablesInitializer?

  • So that's initializing w1 and w2 because these are

  • variables which live inside the graph.

  • So we need to, when we saw this, when we create

  • the tf.variable we have this tf.randomnormal

  • which is this initialization so the

  • tf.GlobalVariablesInitializer is causing the

  • tf.randomnormal to actually run and generate concrete values

  • to initialize those variables.

  • [student's words obscured due to lack of microphone]

  • Sorry, what was the question?

  • [student's words obscured due to lack of microphone]

  • So it knows that a placeholder is going to be fed

  • outside of the graph and a variable is something that

  • lives inside the graph.

  • So I don't know all the details about how it decides,

  • what exactly it decides to run with that call.

  • I think you'd need to dig through the code to figure

  • that out, or maybe it's documented somewhere.

  • So but now we've kind of got this,

  • again we've got this full example of training a

  • network in TensorFlow and we're kind of adding

  • bells and whistles to make it a little bit more convenient.

  • So we can also here, in the previous example

  • we were computing the loss explicitly using our own

  • tensor operations, TensorFlow you can always do that,

  • you can use basic tensor operations to compute

  • just about anything you want.

  • But TensorFlow also gives you a bunch of convenience

  • functions that compute these common neural network things

  • for you.

  • So in this case we can use tf.losses.mean_squared_error

  • and it just does the L2 loss for us so we don't have

  • to compute it ourself in terms of basic tensor operations.

  • So another kind of weirdness here is that it was kind of

  • annoying that we had to explicitly define our inputs

  • and define our weights and then like chain them together

  • in the forward pass using a matrix multiply.

  • And in this example we've actually not put biases

  • in the layer because that would be kind of an extra,

  • then we'd have to initialize biases,

  • we'd have to get them in the right shape,

  • we'd have to broadcast the biases against the output

  • of the matrix multiply and you can see that that

  • would kind of be a lot of code.

  • It would be kind of annoying write.

  • And once you get to like convolutions

  • and batch normalizations and other types of layers

  • this kind of basic way of working,

  • of having these variables, having these inputs and outputs

  • and combining them all together with basic

  • computational graph operations could be a little bit

  • unwieldy and it could be really annoying to

  • make sure you initialize the weights with the right

  • shapes and all that sort of stuff.

  • So as a result, there's a bunch of sort of higher level

  • libraries that wrap around TensorFlow

  • and handle some of these details for you.

  • So one example that ships with TensorFlow,

  • is this tf.layers inside.

  • So now in this code example you can see that our code

  • is only explicitly declaring the X and the Y

  • which are the placeholders for the data and the labels.

  • And now we say that H=tf.layers.dense,

  • we give it the input X and we tell it units=H.

  • This is again kind of a magical line

  • because inside this line, it's kind of setting up

  • w1 and b1, the bias, it's setting up variables for those

  • with the right shapes that are kind of inside the graph

  • but a little bit hidden from us.

  • And it's using this xavier initializer object

  • to set up an initialization strategy for those.

  • So before we were doing that explicitly ourselves

  • with the tf.randomnormal business,

  • but now here it's kind of handling some of those details

  • for us and it's just spitting out an H,

  • which is again the same sort of H that we saw

  • in the previous layer, it's just doing some of those

  • details for us.

  • And you can see here, we're also passing an

  • activation=tf.nn.relu so it's even doing the activation,

  • the relu activation function inside this layer for us.

  • So it's taking care of a lot of these architectural

  • details for us.

  • Question?

  • [student's words obscured due to lack of microphone]

  • Question is does the xavier initializer default

  • to particular distribution?

  • I'm sure it has some default, I'm not sure what it is.

  • I think you'll have to look at the documentation.

  • But it seems to be a reasonable strategy, I guess.

  • And in fact if you run this code,

  • it converges much faster than the previous one

  • because the initialization is better.

  • And you can see that we're using two calls to

  • tf.layers and this lets us build our model

  • without doing all these explicit bookkeeping details

  • ourself.

  • So this is maybe a little bit more convenient.

  • But tf.contrib.layer is really not the only game in town.

  • There's like a lot of different higher level libraries

  • that people build on top of TensorFlow.

  • And it's kind of due to this basic impotence mis-match

  • where the computational graph is relatively low level thing,

  • but when we're working with neural networks

  • we have this concept of layers and weights

  • and some layers have weights associated with them,

  • and we typically think at a slightly higher level

  • of abstraction than this raw computational graph.

  • So that's what these various packages are trying to

  • help you out and let you work at this higher layer

  • of abstraction.

  • So another very popular package that you may have

  • seen before is Keras.

  • Keras is a very beautiful, nice API that sits on top of

  • TensorFlow and handles sort of building up these

  • computational graph for you up in the back end.

  • By the way, Keras also supports Theano as a back end,

  • so that's also kind of nice.

  • And in this example you can see we build the model

  • as a sequence of layers.

  • We build some optimizer object

  • and we call model.compile and this does a lot of magic

  • in the back end to build the graph.

  • And now we can call model.fit and that does the whole

  • training procedure for us magically.

  • So I don't know all the details of how this works,

  • but I know Keras is very popular,

  • so you might consider using it if you're talking about

  • TensorFlow.

  • Question?

  • [student's words obscured due to lack of microphone]

  • Yeah, so the question is like why there's no explicit

  • CPU, GPU going on here.

  • So I've kind of left that out to keep the code clean.

  • But you saw at the beginning examples

  • it was pretty easy to flop all these things

  • between CPU and GPU and there was either some global flag

  • or some different data type

  • or some with statement and it's usually relatively simple

  • and just about one line to swap in each case.

  • But exactly what that line looks like

  • differs a bit depending on the situation.

  • So there's actually like this whole large set

  • of higher level TensorFlow wrappers that you might see

  • out there in the wild.

  • And it seems that like even people within Google

  • can't really agree on which one is the right one to use.

  • So Keras and TFLearn are third party libraries

  • that are out there on the internet by other people.

  • But there's these three different ones,

  • tf.layers, TF-Slim and tf.contrib.learn

  • that all ship with TensorFlow, that are all kind of

  • doing a slightly different version of this

  • higher level wrapper thing.

  • There's another framework also from Google,

  • but not shipping with TensorFlow called Pretty Tensor

  • that does the same sort of thing.

  • And I guess none of these were good enough for DeepMind,

  • because they went ahead a couple weeks ago

  • and wrote and released their very own high level

  • TensorFlow wrapper called Sonnet.

  • So I wouldn't begrudge you if you were kind of confused

  • by all these things.

  • There's a lot of different choices.

  • They don't always play nicely with each other.

  • But you have a lot of options, so that's good.

  • TensorFlow has pretrained models.

  • There's some examples in TF-Slim, and in Keras.

  • 'Cause remember retrained models are super important

  • when you're training your own things.

  • There's also this idea of Tensorboard

  • where you can load up your,

  • I don't want to get into details,

  • but Tensorboard you can add sort of instrumentation

  • to your code and then plot losses and things

  • as you go through the training process.

  • TensorFlow also let's you run distributed

  • where you can break up a computational graph

  • run on different machines.

  • That's super cool but I think probably not anyone

  • outside of Google is really using that to great success

  • these days, but if you do want to run distributed stuff

  • probably TensorFlow is the main game in town for that.

  • A side note is that a lot of the design of TensorFlow

  • is kind of spiritually inspired by this earlier framework

  • called Theano from Montreal.

  • I don't want to go through the details here,

  • just if you go through these slides on your own,

  • you can see that the code for Theano ends up looking

  • very similar to TensorFlow.

  • Where we define some variables,

  • we do some forward pass, we compute some gradients,

  • and we compile some function, then we run the function

  • over and over to train the network.

  • So it kind of looks a lot like TensorFlow.

  • So we still have a lot to get through,

  • so I'm going to move on to PyTorch

  • and maybe take questions at the end.

  • So, PyTorch from Facebook is kind of different from

  • TensorFlow in that we have sort of three explicit

  • different layers of abstraction inside PyTorch.

  • So PyTorch has this tensor object which is just like a

  • Numpy array.

  • It's just an imperative array, it doesn't know anything

  • about deep learning, but it can run with GPU.

  • We have this variable object which is a node in a

  • computational graph which builds up computational graphs,

  • lets you compute gradients, that sort of thing.

  • And we have a module object which is a neural network

  • layer that you can compose together these modules

  • to build big networks.

  • So if you kind of want to think about rough equivalents

  • between PyTorch and TensorFlow you can think of the

  • PyTorch tensor as fulfilling the same role

  • as the Numpy array in TensorFlow.

  • The PyTorch variable is similar to the TensorFlow tensor

  • or variable or placeholder, which are all sort of nodes

  • in a computational graph.

  • And now the PyTorch module is kind of equivalent

  • to these higher level things from tf.slim or tf.layers

  • or sonnet or these other higher level frameworks.

  • So right away one thing to notice about PyTorch

  • is that because it ships with this high level abstraction

  • and like one really nice higher level abstraction

  • called modules on its own, there's sort of less choice

  • involved.

  • Just stick with nnmodules and you'll be good to go.

  • You don't need to worry about which higher level wrapper

  • to use.

  • So PyTorch tensors, as I said, are just like Numpy arrays

  • so here on the right we've done an entire two layer network

  • using entirely PyTorch tensors.

  • One thing to note is that we're not importing Numpy here

  • at all anymore.

  • We're just doing all these operations using PyTorch tensors.

  • And this code looks exactly like the two layer net code

  • that you wrote in Numpy on the first homework.

  • So you set up some random data, you use some operations

  • to compute the forward pass.

  • And then we're explicitly viewing the backward pass

  • ourself.

  • Just sort of backhopping through the network,

  • through the operations, just as you did on homework one.

  • And now we're doing a manual update of the weights

  • using a learning rate and using our computed gradients.

  • But the major difference between the PyTorch tensor

  • and Numpy arrays is that they run on GPU

  • so all you have to do to make this code run on

  • GPU is use a different data type.

  • Rather than using torch.FloatTensor,

  • you do torch.cuda.FloatTensor, cast all of your tensors

  • to this new datatype and everything runs magically

  • on the GPU.

  • You should think of PyTorch tensors as just Numpy plus GPU.

  • That's exactly what it is, nothing specific

  • to deep learning.

  • So the next layer of abstraction in PyTorch is the variable.

  • So this is, once we moved from tensors to variables

  • now we're building computational graphs

  • and we're able to take gradients automatically

  • and everything like that.

  • So here, if X is a variable, then x.data is a tensor

  • and x.grad is another variable containing the gradients

  • of the loss with respect to that tensor.

  • So x.grad.data is an actual tensor containing

  • those gradients.

  • And PyTorch tensors and variables have the exact same API.

  • So any code that worked on PyTorch tensors you can just

  • make them variables instead and run the same code,

  • except now you're building up a computational graph

  • rather than just doing these imperative operations.

  • So here when we create these variables

  • each call to the variable constructor wraps a PyTorch

  • tensor and then also gives a flag whether or not

  • we want to compute gradients with respect to this variable.

  • And now in the forward pass it looks exactly like

  • it did before in the variable in the case with tensors

  • because they have the same API.

  • So now we're computing our predictions,

  • we're computing our loss in kind of this imperative

  • kind of way.

  • And then we call loss.backwards and now all these gradients

  • come out for us.

  • And then we can make a gradient update step

  • on our weights using the gradients that are now present

  • in the w1.grad.data.

  • So this ends up looking quite like the Numpy case,

  • except all the gradients come for free.

  • One thing to note that's kind of different between

  • PyTorch and TensorFlow is that in a TensorFlow case

  • we were building up this explicit graph,

  • then running the graph many times.

  • Here in PyTorch, instead we're building up a new graph

  • every time we do a forward pass.

  • And this makes the code look a bit cleaner.

  • And it has some other implications that we'll

  • get to in a bit.

  • So in PyTorch you can define your own new autograd functions

  • by defining the forward and backward in terms of tensors.

  • This ends up looking kind of like the module layers

  • code that you write for homework two.

  • Where you can implement forward and backward using

  • tensor operations and then stick these things inside

  • computational graph.

  • So here we're defining our own relu

  • and then we can actually go in and use our own relu

  • operation and now stick it inside our computational graph

  • and define our own operations this way.

  • But most of the time you will probably not need

  • to define your own autograd operations.

  • Most of the times the operations you need will

  • mostly be already implemented for you.

  • So in TensorFlow we saw,

  • if we can move to something like Keras or TF.Learn

  • and this gives us a higher level API to work with,

  • rather than this raw computational graphs.

  • The equivalent in PyTorch is the nn package.

  • Where it provides these high level wrappers for working

  • with these things.

  • But unlike TensorFlow there's only one of them.

  • And it works pretty well, so just use that if you're

  • using PyTorch.

  • So here, this ends up kind of looking like Keras

  • where we define our model as some sequence of layers.

  • Our linear and relu operations.

  • And we use some loss function defined in the nn package

  • that's our mean squared error loss.

  • And now inside each iteration of our loop

  • we can run data forward through the model to get

  • our predictions.

  • We can run the predictions forward through the loss function

  • to get our scale or loss,

  • then we can call loss.backward, get all our gradients

  • for free and then loop over the parameters of the models

  • and do our explicit gradient descent step to update

  • the models.

  • And again we see that we're sort of building up this

  • new computational graph every time we do a forward pass.

  • And just like we saw in TensorFlow,

  • PyTorch provides these optimizer operations

  • that kind of abstract away this updating logic

  • and implement fancier update rules like Adam

  • and whatnot.

  • So here we're constructing an optimizer object

  • telling it that we want it to optimize over the

  • parameters of the model.

  • Giving it some learning rate under the hyper parameters.

  • And now after we compute our gradients

  • we can just call optimizer.step and it updates

  • all the parameters of the model for us right here.

  • So another common thing you'll do in PyTorch

  • a lot is define your own nn modules.

  • So typically you'll write your own class

  • which defines you entire model as a single

  • new nn module class.

  • And a module is just kind of a neural network layer

  • that can contain either other other modules

  • or trainable weights or other other kinds of state.

  • So in this case we can redo the two layer net example

  • by defining our own nn module class.

  • So now here in the initializer of the class

  • we're assigning this linear1 and linear2.

  • We're constructing these new module objects

  • and then store them inside of our own class.

  • And now in the forward pass we can use both our own

  • internal modules as well as arbitrary autograd operations

  • on variables to compute the output of our network.

  • So here we receive the, inside this forward method here,

  • the input acts as a variable,

  • then we pass the variable to our self.linear1

  • for the first layer.

  • We use an autograd op clamp to complete the relu,

  • we pass the output of that to the second linear

  • and then that gives us our output.

  • And now the rest of this code for training this thing

  • looks pretty much the same.

  • Where we build an optimizer and loop over

  • and on ever iteration feed data to the model,

  • compute the gradients with loss.backwards,

  • call optimizer.step.

  • So this is like relatively characteristic

  • of what you might see in a lot of PyTorch type

  • training scenarios.

  • Where you define your own class,

  • defining your own model that contains other modules

  • and whatnot and then you have some explicit training

  • loop like this that runs it and updates it.

  • One kind of nice quality of life thing that you have

  • in PyTorch is a dataloader.

  • So a dataloader can handle building minibatches for you.

  • It can handle some of the multi-threading that we talked

  • about for you, where it can actually use multiple threads

  • in the background to build many batches for you

  • and stream off disk.

  • So here a dataloader wraps a dataset and provides

  • some of these abstractions for you.

  • And in practice when you want to run your own data,

  • you typically will write your own dataset class

  • which knows how to read your particular type of data

  • off whatever source you want and then wrap it in

  • a data loader and train with that.

  • So, here we can see that now we're iterating over

  • the dataloader object and at every iteration

  • this is yielding minibatches of data.

  • And it's internally handling the shuffling of the data

  • and multithreaded dataloading and all this sort of stuff

  • for you.

  • So this is kind of a completely PyTorch example

  • and a lot of PyTorch training code ends up looking

  • something like this.

  • PyTorch provides pretrained models.

  • And this is probably the slickest pretrained model

  • experience I've ever seen.

  • You just say torchvision.models.alexnet pretained=true.

  • That'll go down in the background, download the pretrained

  • weights for you if you don't already have them,

  • and then it's right there, you're good to go.

  • So this is super easy to use.

  • PyTorch also has, there's also a package called Visdom

  • that lets you visualize some of these loss statistics

  • somewhat similar to Tensorboard.

  • So that's kind of nice, I haven't actually gotten

  • a chance to play around with this myself so I can't really

  • speak to how useful it is,

  • but one of the major differences between Tensorboard

  • and Visdom is that Tensorboard actually lets you visualize

  • the structure of the computational graph.

  • Which is really cool, a really useful debugging strategy.

  • And Visdom does not have that functionality yet.

  • But I've never really used this myself so I can't really

  • speak to its utility.

  • As a bit of an aside, PyTorch is kind of an evolution of,

  • kind of a newer updated version of an older framework

  • called Torch which I worked with a lot in the last

  • couple of years.

  • And I don't want to go through the details here,

  • but PyTorch is pretty much better in a lot of ways

  • than the old Lua Torch, but they actually share a lot

  • of the same back end C code for computing with tensors

  • and GPU operations on tensors and whatnot.

  • So if you look through this Torch example,

  • some of it ends up looking kind of similar to PyTorch,

  • some of it's a bit different.

  • Maybe you can step through this offline.

  • But kind of the high level differences between

  • Torch and PyTorch are that Torch is actually in Lua,

  • not Python, unlike these other things.

  • So learning Lua is a bit of a turn off for some people.

  • Torch doesn't have autograd.

  • Torch is also older, so it's more stable,

  • less susceptible to bugs, there's maybe more example code

  • for Torch.

  • They're about the same speeds, that's not really a concern.

  • But in PyTorch it's in Python which is great,

  • you've got autograd which makes it a lot simpler

  • to write complex models.

  • In Lua Torch you end up writing a lot of your own

  • back prop code sometimes, so that's a little bit annoying.

  • But PyTorch is newer, there's less existing code,

  • it's still subject to change.

  • So it's a little bit more of an adventure.

  • But at least for me, I kind of prefer,

  • I don't really see much reason for myself

  • to use Torch over PyTorch anymore at this time.

  • So I'm pretty much using PyTorch exclusively for

  • all my work these days.

  • We talked about this a little bit about this idea

  • of static versus dynamic graphs.

  • And this is one of the main distinguishing features

  • between PyTorch and TensorFlow.

  • So we saw in TensorFlow you have these two stages

  • of operation where first you build up this

  • computational graph, then you run the computational graph

  • over and over again many many times reusing that same

  • graph.

  • That's called a static computational graph 'cause there's

  • only one of them.

  • And we saw PyTorch is quite different where we're actually

  • building up this new computational graph,

  • this new fresh thing on every forward pass.

  • That's called a dynamic computational graph.

  • For kind of simple cases, with kind of feed forward

  • neural networks, it doesn't really make a huge difference,

  • the code ends up kind of similarly

  • and they work kind of similarly,

  • but I do want to talk a bit about some of the implications

  • of static versus dynamic.

  • And what are the tradeoffs of those two.

  • So one kind of nice idea with static graphs

  • is that because we're kind of building up one

  • computational graph once, and then reusing it many times,

  • the framework might have the opportunity to go in

  • and do optimizations on that graph.

  • And kind of fuse some operations, reorder some operations,

  • figure out the most efficient way to operate

  • that graph so it can be really efficient.

  • And because we're going to reuse that graph

  • many times, maybe that optimization process

  • is expensive up front,

  • but we can amortize that cost with the speedups

  • that we've gotten when we run the graph many many times.

  • So as kind of a concrete example,

  • maybe if you write some graph which has convolution

  • and relu operations kind of one after another,

  • you might imagine that some fancy graph optimizer

  • could go in and actually output, like emit custom code

  • which has fused operations, fusing the convolution

  • and the relu so now it's computing the same thing

  • as the code you wrote, but now might be able to be

  • executed more efficiently.

  • So I'm not too sure on exactly what the state in practice

  • of TensorFlow graph optimization is right now,

  • but at least in principle, this is one place where

  • static graph really, you can have the potential for

  • doing this optimization in static graphs

  • where maybe it would be not so tractable for dynamic graphs.

  • Another kind of subtle point about static versus dynamic

  • is this idea of serialization.

  • So with a static graph you can imagine that you write

  • this code that builds up the graph

  • and then once you've built the graph,

  • you have this data structure in memory that represents

  • the entire structure of your network.

  • And now you could take that data structure

  • and just serialize it to disk.

  • And now you've got the whole structure of your network

  • saved in some file.

  • And then you could later rear load that thing

  • and then run that computational graph without access

  • to the original code that built it.

  • So this would be kind of nice in a deployment scenario.

  • You might imagine that you might want to train your

  • network in Python because it's maybe easier to work with,

  • but then after you serialize that network

  • and then you could deploy it now in maybe a C++

  • environment where you don't need to use the original

  • code that built the graph.

  • So that's kind of a nice advantage of static graphs.

  • Whereas with a dynamic graph, because we're interleaving

  • these processes of graph building and graph execution,

  • you kind of need the original code at all times

  • if you want to reuse that model in the future.

  • On the other hand, some advantages for dynamic graphs

  • are that it kind of makes, it just makes your code

  • a lot cleaner and a lot easier in a lot of scenarios.

  • So for example, suppose that we want to do some

  • conditional operation where depending on the value

  • of some variable Z, we want to do different operations

  • to compute Y.

  • Where if Z is positive, we want to use one weight matrix,

  • if Z is negative we want to use a different weight matrix.

  • And we just want to switch off between these two alternatives.

  • In PyTorch because we're using dynamic graphs,

  • it's super simple.

  • Your code kind of looks exactly like you would expect,

  • exactly what you would do in Numpy.

  • You can just use normal Python control flow

  • to handle this thing.

  • And now because we're building up the graph each time,

  • each time we perform this operation will take one

  • of the two paths and build up maybe a different graph

  • on each forward pass, but for any graph that we do

  • end up building up, we can back propagate through it

  • just fine.

  • And the code is very clean, easy to work with.

  • Now in TensorFlow the situations is a little bit more

  • complicated because we build the graph once,

  • this control flow operator kind of needs to be

  • an explicit operator in the TensorFlow graph.

  • And now, so them you can see that we have this

  • tf.cond call which is kind of like a TensorFlow version

  • of an if statement, but now it's baked into

  • the computational graph rather than using sort of

  • Python control flow.

  • And the problem is that because we only build the graph

  • once, all the potential paths of control flow that

  • our program might flow through need to be baked

  • into the graph at the time we construct it before we ever

  • run it.

  • So that means that any kind of control flow operators

  • that you want to have need to be not Python control flow

  • operators, you need to use some kind of magic,

  • special tensor flow operations to do control flow.

  • In this case this tf.cond.

  • Another kind of similar situation happens if you want to

  • have loops.

  • So suppose that we want to compute some kind of recurrent

  • relationships where maybe Y T is equal to Y T minus one

  • plus X T times some weight matrix W and depending on

  • each time we do this, every time we compute this,

  • we might have a different sized sequence of data.

  • And no matter the length of our sequence of data,

  • we just want to compute this same recurrence relation

  • no matter the size of the input sequence.

  • So in PyTorch this is super easy.

  • We can just kind of use a normal for loop in Python

  • to just loop over the number of times that we want to

  • unroll and now depending on the size of the input data,

  • our computational graph will end up as different sizes,

  • but that's fine, we can just back propagate through

  • each one, one at a time.

  • Now in TensorFlow this becomes a little bit uglier.

  • And again, because we need to construct the graph

  • all at once up front, this control flow looping construct

  • again needs to be an explicit node in the TensorFlow graph.

  • So I hope you remember your functional programming

  • because you'll have to use those kinds of operators

  • to implement looping constructs in TensorFlow.

  • So in this case, for this particular recurrence relationship

  • you can use a foldl operation and pass in,

  • sort of implement this particular loop in terms of a foldl.

  • But what this basically means is that you have this sense

  • that TensorFlow is almost building its own entire

  • programming language, using the language of

  • computational graphs.

  • And any kind of control flow operator,

  • or any kind of data structure needs to be rolled

  • into the computational graph so you can't really utilize

  • all your favorite paradigms for working imperatively

  • in Python.

  • You kind of need to relearn a whole separate set

  • of control flow operators.

  • And if you want to do any kinds of control flow

  • inside your computational graph using TensorFlow.

  • So at least for me, I find that kind of confusing,

  • a little bit hard to wrap my head around sometimes,

  • and I kind of like that using PyTorch dynamic graphs,

  • you can just use your favorite imperative programming

  • constructs and it all works just fine.

  • By the way, there actually is some very new library

  • called TensorFlow Fold which is another one of these

  • layers on top of TensorFlow that lets you implement

  • dynamic graphs, you kind of write your own code

  • using TensorFlow Fold that looks kind of like a dynamic

  • graph operation and then TensorFlow Fold does some magic

  • for you and somehow implements that in terms of the

  • static TensorFlow graphs.

  • This is a super new paper that's being presented

  • at ICLR this week in France.

  • So I haven't had the chance to like dive in and play

  • with this yet.

  • But my initial impression was that it does add some

  • amount of dynamic graphs to TensorFlow but it is still

  • a bit more awkward to work with than the sort of native

  • dynamic graphs you have in PyTorch.

  • So then, I thought it might be nice to motivate

  • like why would we care about dynamic graphs in general?

  • So one option is recurrent networks.

  • So you can see that for something like image captioning

  • we use a recurrent network which operates over

  • sequences of different lengths.

  • In this case, the sentence that we want to generate

  • as a caption is a sequence and that sequence can vary

  • depending on our input data.

  • So now you can see that we have this dynamism in the thing

  • where depending on the size of the sentence,

  • our computational graph might need to have more

  • or fewer elements.

  • So that's one kind of common application of dynamic graphs.

  • For those of you who took CS224N last quarter,

  • you saw this idea of recursive networks

  • where sometimes in natural language processing

  • you might, for example, compute a parsed tree

  • of a sentence and then you want to have a neural

  • network kind of operate recursively up this parse tree.

  • So having a neural network that kind of works,

  • it's not just a sequential sequence of layers,

  • but instead it's kind of working over some graph

  • or tree structure instead where now each data point

  • might have a different graph or tree structure

  • so the structure of the computational graph

  • then kind of mirrors the structure of the input data.

  • And it could vary from data point to data point.

  • So this type of thing seems kind of complicated and

  • hairy to implement using TensorFlow,

  • but in PyTorch you can just kind of use

  • like normal Python control flow and it'll work out

  • just fine.

  • Another bit of more researchy application is this really

  • cool idea that I like called neuromodule networks

  • for visual question answering.

  • So here the idea is that we want to ask some questions

  • about images where we maybe input this image

  • of cats and dogs, there's some question,

  • what color is the cat, and then internally the system

  • can read the question and that has these different

  • specialized neural network modules for performing

  • operations like asking for colors and finding cats.

  • And then depending on the text of the question,

  • it can compile this custom architecture for answering

  • the question.

  • And now if we asked a different question,

  • like are there more cats than dogs?

  • Now we have maybe the same basic set of modules

  • for doing things like finding cats and dogs and counting,

  • but they're arranged in a different order.

  • So we get this dynamism again where different data points

  • might give rise to different computational graphs.

  • But this is a bit more of a researchy thing

  • and maybe not so main stream right now.

  • But as kind of a bigger point, I think that there's

  • a lot of cool, creative applications that people

  • could do with dynamic computational graphs

  • and maybe there aren't so many right now,

  • just because it's been so painful to work with them.

  • So I think that there's a lot of opportunity

  • for doing cool, creative things with

  • dynamic computational graphs.

  • And maybe if you come up with cool ideas,

  • we'll feature it in lecture next year.

  • So I wanted to talk very briefly about Caffe

  • which is this framework from Berkeley.

  • Which Caffe is somewhat different from the other

  • deep learning frameworks where you in many cases

  • you can actually train networks without writing

  • any code yourself.

  • You kind of just call into these pre-existing binaries,

  • set up some configuration files and in many cases

  • you can train on data without writing any of your own code.

  • So, you may be first, you convert your data

  • into some format like HDF5 or LMDB and there exists

  • some scripts inside Caffe that can just convert like

  • folders of images and text files into these formats for you.

  • You need to define, now instead of writing code

  • to define the structure of your computational graph,

  • instead you edit some text file called a prototxt

  • which sets up the structure of the computational graph.

  • Here the structure is that we read from some input

  • HDF5 file, we perform some inner product,

  • we compute some loss and the whole structure

  • of the graph is set up in this text file.

  • One kind of downside here is that these files

  • can get really ugly for very large networks.

  • So for something like the 152 layer ResNet model,

  • which by the way was trained in Caffe originally,

  • then this prototxt file ends up almost 7000 lines long.

  • So people are not writing these by hand.

  • People will sometimes will like write python scripts

  • to generate these prototxt files.

  • [laughter]

  • Then you're kind in the realm of rolling your own

  • computational graph abstraction.

  • That's probably not a good idea, but I've seen that before.

  • Then, rather than having some optimizer object,

  • instead there's some solver, you define some solver things

  • inside another prototxt.

  • This defines your learning rate,

  • your optimization algorithm and whatnot.

  • And then once you do all these things,

  • you can just run the Caffe binary with the train command

  • and it all happens magically.

  • Cafee has a model zoo with a bunch of pretrained models,

  • that's pretty useful.

  • Caffe has a Python interface but it's not super

  • well documented.

  • You kind of need to read the source code of the python

  • interface to see what it can do,

  • so that's kind of annoying.

  • But it does work.

  • So, kind of my general thing about Caffe is that it's

  • maybe good for feed forward models,

  • it's maybe good for production scenarios,

  • because it doesn't depend on Python.

  • But probably for research these days, I've seen Caffe

  • being used maybe a little bit less.

  • Although I think it is still pretty commonly used

  • in industry again for production.

  • I promise one slide, one or two slides on Caffe 2.

  • So Caffe 2 is the successor to Caffe which is from Facebook.

  • It's super new, it was only released a week ago.

  • [laughter]

  • So I really haven't had the time to form a super

  • educated opinion about Caffe 2 yet,

  • but it uses static graphs kind of similar to TensorFlow.

  • Kind of like Caffe one the core is written in C++

  • and they have some Python interface.

  • The difference is that now you no longer need to

  • write your own Python scripts to generate prototxt files.

  • You can kind of define your computational graph structure

  • all in Python, kind of looking with an API that looks

  • kind of like TensorFlow.

  • But then you can spit out, you can serialize this

  • computational graph structure to a prototxt file.

  • And then once your model is trained and whatnot,

  • then we get this benefit that we talked about of static

  • graphs where you can, you don't need the original

  • training code now in order to deploy a trained model.

  • So one interesting thing is that you've seen Google

  • maybe has one major deep running framework,

  • which is TensorFlow, where Facebook has these two,

  • PyTorch and Caffe 2.

  • So these are kind of different philosophies.

  • Google's kind of trying to build one framework to rule

  • them all that maybe works for every possible scenario

  • for deep learning.

  • This is kind of nice because it consolidates all efforts

  • onto one framework.

  • It means you only need to learn one thing

  • and it'll work across many different scenarios

  • including like distributed systems, production,

  • deployment, mobile, research, everything.

  • Only need to learn one framework to do all these things.

  • Whereas Facebook is taking a bit of a different approach.

  • Where PyTorch is really more specialized,

  • more geared towards research so in terms of writing

  • research code and quickly iterating on your ideas,

  • that's super easy in PyTorch, but for things like

  • running in production, running on mobile devices,

  • PyTorch doesn't have a lot of great support.

  • Instead, Caffe 2 is kind of geared toward those more

  • production oriented use cases.

  • So my kind of general study, my general, overall advice

  • about like which framework to use for which problems

  • is kind of that both,

  • I think TensorFlow is a pretty safe bet for just about

  • any project that you want to start new, right?

  • Because it is sort of one framework to rule them all,

  • it can be used for just about any circumstance.

  • However, you probably need to pair it with a

  • higher level wrapper and if you want dynamic graphs,

  • you're maybe out of luck.

  • Some of the code ends up looking a little bit uglier

  • in my opinion, but maybe that's kind of a cosmetic detail

  • and it doesn't really matter that much.

  • I personally think PyTorch is really great for research.

  • If you're focused on just writing research code,

  • I think PyTorch is a great choice.

  • But it's a bit newer, has less community support,

  • less code out there, so it could be a bit of an adventure.

  • If you want more of a well trodden path, TensorFlow

  • might be a better choice.

  • If you're interested in production deployment,

  • you should probably look at Caffe, Caffe 2 or TensorFlow.

  • And if you're really focused on mobile deployment,

  • I think TensorFlow and Caffe 2 both have some built in

  • support for that.

  • So it's kind of unfortunately, there's not just like

  • one global best framework, it kind of depends

  • on what you're actually trying to do,

  • what applications you anticipate but theses are kind of

  • my general advice on those things.

  • So next time we'll talk about some case studies

  • about various CNN architectures.

- Hello?

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it