Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • ALEXANDRE PASSOS: Hello, my name is Alex,

  • and I work on TensorFlow.

  • I am here today to tell you all a little bit about how

  • you can use TensorFlow to do deep learning research more

  • effectively.

  • What we're going to do today is we're

  • going to take a little tour of a few TensorFlow features that

  • show you how controllable, flexible, and composable

  • TensorFlow is.

  • We'll take a quick look at those features, some old

  • and some new.

  • And not, by far, all the features

  • are useful for research.

  • But these features let you accelerate

  • your research using TensorFlow in ways

  • that perhaps you're not aware of.

  • And I want to start by helping you control how TensorFlow

  • represents state.

  • If you've used TensorFlow before,

  • and I am sure you have at this point,

  • you know that a lot of our libraries

  • use TF variables to represent state,

  • like your model parameters.

  • And for example, a Keras dense layer

  • has one kernel matrix and an optional bias

  • vector stored in it.

  • And these parameters are updated when you train your model.

  • And part of the whole point of training models

  • is so that we find out what value those parameters should

  • have had in the first place.

  • And if you're making your own layers library,

  • you can control absolutely everything about how

  • that state is represented.

  • But you can also crack open the black box

  • and control how state is represented,

  • even inside the libraries that we give you.

  • So for example, we're going to use this little running example

  • of what if I wanted to re-parametrize a Keras

  • layer so it does some computation to generate

  • the kernel matrix, say to save space

  • or to get the correct inductive bias.

  • The way to do this is to use tf.variable_creator_scrope.

  • It is a tool we have that lets you take control of the state

  • creation process in TensorFlow.

  • It's a context manager, and all variables created under it

  • go through a function you specify.

  • And this function can choose to do nothing.

  • It can delegate.

  • Or it can modify how variables are created.

  • Under the hood, this is what distributionstrategy.scope

  • usually implies.

  • So it's the same tool that we use

  • to build TensorFlow that we make available to you,

  • so you can extend it.

  • And here, if I wanted to do this re-parametrization of the Keras

  • layer, it's actually pretty simple.

  • First, I define what type I want to use to store those things.

  • Here, I'm using this vectorize variable type,

  • which is a tf.module.

  • tf.modules are a very convenient type.

  • You can have variables as members,

  • and we can track them automatically

  • for you and all sorts of nice things.

  • And once we define this type, it's

  • really just a left half and right half.

  • I can tell TensorFlow how do I use

  • objects of this type as a part of TensorFlow computations.

  • And what we do here is we do a matrix multiplication

  • of the left component and the right component.

  • And now that I know how to use this object, I can create it.

  • And this is all that I need to make

  • my own little, variable_creator_scope.

  • In this case, I want to peek at the shape.

  • And if I'm not creating a matrix,

  • just delegate to whatever TensorFlow

  • would have done, normally.

  • And if I am creating a matrix, instead

  • of creating a single matrix, I'm going

  • to create this factorized variable that

  • has the left half and the right half.

  • And finally, I now get to just use it.

  • And here, I create a little Keras layer.

  • I apply it.

  • And I can check that it is indeed using

  • my vectorized representation.

  • This gives you a lot of power.

  • Because now, you can take large libraries of code

  • that you did not write and do dependency injection

  • to change how they behave.

  • Probably if you're going to do this at scale,

  • you might want to implement your own layer

  • so you can have full control.

  • But it's also very valuable for you

  • to be able to extend the ones that we provide you.

  • So use tf.variable_creator_scope to control the stage.

  • A big part of TensorFlow and why we

  • use these libraries to do research at all,

  • as opposed to just writing plain Python code,

  • is that deep learning is really dependent on very

  • fast computation.

  • And one thing that we're making more and more

  • easy to use in TensorFlow is our underlying compiler, XLA, which

  • we've always used for TPUs.

  • But now, we're making it easier for you to use

  • for CPUs and GPUs, as well.

  • And the way we're doing this is using tf.function with

  • the experimental_compile=True annotation.

  • What this means is if you mark a function as a function

  • that you want to compile, we will compile it,

  • or we'll raise an error.

  • So you can trust the code you write

  • inside a block is going to run as quickly as if you had

  • handwritten your own fuse TensorFlow kernel for CPUs,

  • and a Fuse.ko kernel, and then all the machinery, yourself.

  • But you get to write high level, fast, Python TensorFlow code.

  • One example where you might easily

  • find yourself writing your own little custom kernel

  • is if you want to do research on activation functions, which

  • is something that people want to do.

  • In activation functions, this is a terrible one.

  • But they tend to look a little like this.

  • They have a bunch of nonlinear operations

  • and a bunch of element-wise things.

  • But in general, they apply lots of

  • little element-wise operations to each element of your vector.

  • And these things, if you try to run them

  • in the normal TensorFlow interpreter,

  • they're going to be rather slow, because they're

  • going to do a new memory allocation and a copy of things

  • around for every single one of these little operations.

  • While if you were to make a fused, single kernel,

  • you just write a single thing for each coordinate that

  • does the explanation, and logarithm, and addition,

  • and all the things like that.

  • But what we can see here is that if I take this function,

  • and I wrap it with experimental_compile=True,

  • and I benchmark running a compiled version versus running

  • a non-compiled version, on this tiny benchmark,

  • I can already see a 25% speedup.

  • And it's even better than this, because we

  • see speedups of this sort of magnitude or larger,

  • even on fairly large models, including Bert.

  • Because in large models, we can fuse more computation

  • into the linear operations, and your reductions,

  • and things like that.

  • And this can get you compounding wins.

  • So try using experimental_compile=True

  • for automatic compilation in TensorFlow.

  • You should be able to apply it to small pieces of code

  • and replace what you'd normally have to do with fused kernels.

  • So you know what type of researching code a lot

  • of people rely on that has lots of very small element-wise

  • operations and that which would greatly benefit from the fusion

  • powers of a compiler--

  • I think it's optimizers.

  • And a nice thing about doing your optimizer research

  • in TensorFlow is that Keras makes it very easy

  • for you to implement your own stochastic gradient in style

  • optimizer.

  • You can make a class that subclasses

  • that TF Keras optimizer and override three methods.

  • You can define your initialization

  • while you compute your learning rate or whatever,

  • and you're in it.

  • You can create any accumulator variables, like your momentum,

  • or higher order powers of gradients, or anything else

  • you need, and create slots.

  • And you can define how to apply this optimizer

  • update to a single variable.

  • Once you've defined those three things,

  • you have everything TensorFlow needs

  • to be able to run your custom optimizer.

  • And normally, TensorFlow optimizers

  • are written with hand-fused kernels, which

  • can make the code very complicated to read,

  • but ensures that they run very quickly.

  • What I'm going to show here is an example

  • of a very simple optimizer-- again, not

  • a particularly good one.

  • This is a weird variation that has

  • some momentum and some higher order powers,

  • but it doesn't train very well.

  • However, it has the same sorts of operations that you

  • would have on a real optimizer.

  • And I can just write them as regular TensorFlow operations

  • in my model.

  • And by just adding this line with experimental_compile=True,

  • I can get it to run just as fast as a hand-fused kernel.

  • And the benchmarks are written here.

  • It was over a 2x speed up.

  • So this can really matter when you're

  • doing a lot of research that looks like this.

  • Something else-- so Keras optimizes in compilation.

  • You experiment really fast or with fairly intricate things,

  • and I hope you will use this to accelerate your research.

  • The next thing I want to talk about is vectorization.

  • It's, again, super important for performance.

  • I'm sure you've heard, at this point,

  • that Moore's Law is over, and we're no longer

  • going to get a free lunch in terms

  • of processes getting faster.

  • The way we're making our machine learning models faster

  • is by doing more and more things in parallel.

  • And this is great, because we get to unlock

  • the potential of GPUs and TPUs.

  • This is also a little scary, because now,

  • even though we know what we want to do to a single, little data

  • point, we have to write these batched operations, which

  • can be fairly complicated.

  • In TensorFlow, we've been developing,

  • recently, automatic vectorization for you,

  • where you can write the element-wise code that you want

  • to write and get the performance of the batched computation

  • that you want.

  • So the working example I'm going to use here is Jacobians.

  • If you're familiar with TensorFlow's gradient tape,

  • you know that tape.gradient computes an element-wise--

  • computes a gradient of a scalar, not a gradient

  • of a vector value or a matrix value function.

  • And if you want the Jacobian of a vector value

  • to a matrix valued function, you can just

  • call tape.gradient many, many times.

  • And here, I have a very, very simple function

  • that is just the explanation of the square of a matrix.

  • And I want to compute the Jacobian.

  • And I do this by writing this double

  • for loop, where for every row, for every column,

  • I compute the gradient with respect to the row and column

  • output, and then stack the results together

  • to get my higher order, Jacobian tensor.

  • This is fine.

  • This has always worked.

  • However, you can replace these explicit loops with

  • tf.vectorized_map.

  • And one, you get a small readability win.

  • Because now we're saying that, yes, you're

  • just applying this operation everywhere.

  • But also, you get a very big performance win.

  • And this version that uses tf.vectorized_map is

  • substantially faster than the version that doesn't use it.

  • But of course, you don't want to have

  • to write this all the time, which

  • is why, really, for Jacobians, we implemented it directly

  • in the gradient tape.

  • And you can call tape.Jacobian to get the Jacobian computer

  • for you.

  • And if you do this, it's over 10 times faster on this example

  • than doing the manual loop yourself because we can

  • do the automatic vectorization.

  • But the reason why I opened this black box

  • and showed you the previous slide

  • is so you can know how to implement something that is not

  • a Jacobian but is like Jacobian yourself-

  • and how you can use TensorFlow's automatic vectorization

  • capabilities together with the other tools you

  • have in your research to make you more productive.

  • So remember to use automatic vectorization,

  • so you can write short code that actually runs really fast.

  • And let us add the batched dimensions ourselves.

  • And here is another interesting performance point.

  • Because with TensorFlow, we have always

  • had the big, rectangular array or hyper-array, the tensor,

  • as the core data structure.

  • And tensors are great.

  • In a world where we live in today,

  • where we need to leverage as much parallelism

  • as we can to make our models go fast, operations on tensors

  • tend to be naturally highly parallel by default.

  • It's a very intuitive API to program the capabilities

  • of these supercomputers we have today, with many GPUs

  • and TPUs wired together.

  • And as long as you can stay within this tensor box,

  • you are happy.

  • You get peak performance.

  • And everything's great.

  • However, as deep learning becomes

  • more and more successful, and as we

  • want to do research on more and more different types of data,

  • we start to want to work with things that don't really

  • look like these big, rectangular arrays--

  • a structure that is ragged and has a different shape.

  • And in TensorFlow, we've been recently working

  • really hard at adding native support to ragged data.

  • So here's an example.

  • Pretend it's 10 years ago and you have a sentence.

  • You have a bunch of sentences.

  • They all have different lengths.

  • And you want to turn them into embedding so you can feed them

  • into a neural network.

  • So what you want to do here is you're

  • going to start with all the words in that sentence.

  • You're going to look up their index in your vocabulary table.

  • Then you're going to use the index to look up

  • a row in an embedding table.

  • And finally, you want to average the embeddings of each sentence

  • to get an embedding for each-- the embeddings of all

  • the words in a sentence to get an embedding for each sentence,

  • which you can then use in the rest of your model.

  • And even though we're working with ragged data

  • here, because all the sentences have different lengths, if you

  • think about the underlying operations that we're

  • doing here, most of them don't actually

  • have to care about this raggedness.

  • So we can make this run very efficiently

  • by decomposing this representation

  • into two things--

  • a tensor that concatenates across the ragged dimension

  • and a separate tensor that tells you

  • how to find the individual ragged elements in there.

  • And once you have this representation,

  • it's very easy and efficient to do all the computations

  • that we wanted to do to solve the task

  • from the previous slide.

  • You have always been able to do this manually in TensorFlow.

  • We've always had the features and capabilities for you

  • to do this.

  • Now, with tf.ragged_tensor, we're taking over

  • the management of this from you and just giving you an object,

  • a ragged tensor, that looks like a tensor.

  • It can be manipulated like a tensor,

  • but is represented like this.

  • And so it has ragged shapes and can

  • represent much more flexible data structures

  • than you could otherwise.

  • So let's go over a little bit of a code example, here.

  • Here is my data, same one from the previous slides.

  • It's just a Python list.

  • And I can take this Python list and turn it

  • into a ragged tensor by using tf.ragged.constant.

  • And the right thing is going to happen.

  • TensorFlow is going to automatically concatenate

  • across the ragged dimension and keep this array of indices

  • under the hood.

  • Then I can define my vocabulary table and do my lookup.

  • And here, I'm showing you how to do your lookup or any operation

  • on a ragged tensor where that operation hasn't actually

  • been rewritten to support raggedness.

  • You can always use tf.ragged.mapflatvalues

  • to access the underlying values of your ragged tensor,

  • and apply operations in them.

  • Once we dedicate an embedded matrix,

  • also, many of the TensorFlow core operations

  • have been adapted to work with ragged tensors.

  • So in this case, if you want to do a tf.gather

  • to find out the correct rows of the embedding

  • matrix for each word, you can just

  • apply your tf.gather on the ragged tensor,

  • and the right thing will happen.

  • And similarly, if you want to reduce and average out

  • the ragged dimension, it's very easy to do.

  • You can just use the standard tf.reduce_mean.

  • And the nice thing is that, at this point, because we've

  • reduced out the ragged dimension,

  • we have no ragged dimension.

  • And we just have a dense tensor that

  • has the original shape you expected to have.

  • And I think this is really important, because now, it's

  • much easier, much more intuitive and affordable for you

  • to work with data that doesn't necessarily

  • look like the big, rectangular data

  • that TensorFlow is optimized for.

  • And yet, it lets you get most of the performance

  • that you'd get with the big, rectangular data.

  • It's a win-win situation, and I'm really looking forward

  • to see what interesting applications you all

  • are going to work on that use and exploit

  • this notion of raggedness.

  • So please, play with tf.ragged.

  • Try it out.

  • It's very exciting.

  • So next up, we're going to go over

  • a particular, interesting example of research

  • done with TensorFlow.

  • And Achshai here, who is a PhD student at Stanford University,

  • is going to come and tell us all about convex optimization

  • layers in TensorFlow.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it