Subtitles section Play video Print subtitles [MUSIC PLAYING] ALEXANDRE PASSOS: Hello, my name is Alex, and I work on TensorFlow. I am here today to tell you all a little bit about how you can use TensorFlow to do deep learning research more effectively. What we're going to do today is we're going to take a little tour of a few TensorFlow features that show you how controllable, flexible, and composable TensorFlow is. We'll take a quick look at those features, some old and some new. And not, by far, all the features are useful for research. But these features let you accelerate your research using TensorFlow in ways that perhaps you're not aware of. And I want to start by helping you control how TensorFlow represents state. If you've used TensorFlow before, and I am sure you have at this point, you know that a lot of our libraries use TF variables to represent state, like your model parameters. And for example, a Keras dense layer has one kernel matrix and an optional bias vector stored in it. And these parameters are updated when you train your model. And part of the whole point of training models is so that we find out what value those parameters should have had in the first place. And if you're making your own layers library, you can control absolutely everything about how that state is represented. But you can also crack open the black box and control how state is represented, even inside the libraries that we give you. So for example, we're going to use this little running example of what if I wanted to re-parametrize a Keras layer so it does some computation to generate the kernel matrix, say to save space or to get the correct inductive bias. The way to do this is to use tf.variable_creator_scrope. It is a tool we have that lets you take control of the state creation process in TensorFlow. It's a context manager, and all variables created under it go through a function you specify. And this function can choose to do nothing. It can delegate. Or it can modify how variables are created. Under the hood, this is what distributionstrategy.scope usually implies. So it's the same tool that we use to build TensorFlow that we make available to you, so you can extend it. And here, if I wanted to do this re-parametrization of the Keras layer, it's actually pretty simple. First, I define what type I want to use to store those things. Here, I'm using this vectorize variable type, which is a tf.module. tf.modules are a very convenient type. You can have variables as members, and we can track them automatically for you and all sorts of nice things. And once we define this type, it's really just a left half and right half. I can tell TensorFlow how do I use objects of this type as a part of TensorFlow computations. And what we do here is we do a matrix multiplication of the left component and the right component. And now that I know how to use this object, I can create it. And this is all that I need to make my own little, variable_creator_scope. In this case, I want to peek at the shape. And if I'm not creating a matrix, just delegate to whatever TensorFlow would have done, normally. And if I am creating a matrix, instead of creating a single matrix, I'm going to create this factorized variable that has the left half and the right half. And finally, I now get to just use it. And here, I create a little Keras layer. I apply it. And I can check that it is indeed using my vectorized representation. This gives you a lot of power. Because now, you can take large libraries of code that you did not write and do dependency injection to change how they behave. Probably if you're going to do this at scale, you might want to implement your own layer so you can have full control. But it's also very valuable for you to be able to extend the ones that we provide you. So use tf.variable_creator_scope to control the stage. A big part of TensorFlow and why we use these libraries to do research at all, as opposed to just writing plain Python code, is that deep learning is really dependent on very fast computation. And one thing that we're making more and more easy to use in TensorFlow is our underlying compiler, XLA, which we've always used for TPUs. But now, we're making it easier for you to use for CPUs and GPUs, as well. And the way we're doing this is using tf.function with the experimental_compile=True annotation. What this means is if you mark a function as a function that you want to compile, we will compile it, or we'll raise an error. So you can trust the code you write inside a block is going to run as quickly as if you had handwritten your own fuse TensorFlow kernel for CPUs, and a Fuse.ko kernel, and then all the machinery, yourself. But you get to write high level, fast, Python TensorFlow code. One example where you might easily find yourself writing your own little custom kernel is if you want to do research on activation functions, which is something that people want to do. In activation functions, this is a terrible one. But they tend to look a little like this. They have a bunch of nonlinear operations and a bunch of element-wise things. But in general, they apply lots of little element-wise operations to each element of your vector. And these things, if you try to run them in the normal TensorFlow interpreter, they're going to be rather slow, because they're going to do a new memory allocation and a copy of things around for every single one of these little operations. While if you were to make a fused, single kernel, you just write a single thing for each coordinate that does the explanation, and logarithm, and addition, and all the things like that. But what we can see here is that if I take this function, and I wrap it with experimental_compile=True, and I benchmark running a compiled version versus running a non-compiled version, on this tiny benchmark, I can already see a 25% speedup. And it's even better than this, because we see speedups of this sort of magnitude or larger, even on fairly large models, including Bert. Because in large models, we can fuse more computation into the linear operations, and your reductions, and things like that. And this can get you compounding wins. So try using experimental_compile=True for automatic compilation in TensorFlow. You should be able to apply it to small pieces of code and replace what you'd normally have to do with fused kernels. So you know what type of researching code a lot of people rely on that has lots of very small element-wise operations and that which would greatly benefit from the fusion powers of a compiler-- I think it's optimizers. And a nice thing about doing your optimizer research in TensorFlow is that Keras makes it very easy for you to implement your own stochastic gradient in style optimizer. You can make a class that subclasses that TF Keras optimizer and override three methods. You can define your initialization while you compute your learning rate or whatever, and you're in it. You can create any accumulator variables, like your momentum, or higher order powers of gradients, or anything else you need, and create slots. And you can define how to apply this optimizer update to a single variable. Once you've defined those three things, you have everything TensorFlow needs to be able to run your custom optimizer. And normally, TensorFlow optimizers are written with hand-fused kernels, which can make the code very complicated to read, but ensures that they run very quickly. What I'm going to show here is an example of a very simple optimizer-- again, not a particularly good one. This is a weird variation that has some momentum and some higher order powers, but it doesn't train very well. However, it has the same sorts of operations that you would have on a real optimizer. And I can just write them as regular TensorFlow operations in my model. And by just adding this line with experimental_compile=True, I can get it to run just as fast as a hand-fused kernel. And the benchmarks are written here. It was over a 2x speed up. So this can really matter when you're doing a lot of research that looks like this. Something else-- so Keras optimizes in compilation. You experiment really fast or with fairly intricate things, and I hope you will use this to accelerate your research. The next thing I want to talk about is vectorization. It's, again, super important for performance. I'm sure you've heard, at this point, that Moore's Law is over, and we're no longer going to get a free lunch in terms of processes getting faster. The way we're making our machine learning models faster is by doing more and more things in parallel. And this is great, because we get to unlock the potential of GPUs and TPUs. This is also a little scary, because now, even though we know what we want to do to a single, little data point, we have to write these batched operations, which can be fairly complicated. In TensorFlow, we've been developing, recently, automatic vectorization for you, where you can write the element-wise code that you want to write and get the performance of the batched computation that you want. So the working example I'm going to use here is Jacobians. If you're familiar with TensorFlow's gradient tape, you know that tape.gradient computes an element-wise-- computes a gradient of a scalar, not a gradient of a vector value or a matrix value function. And if you want the Jacobian of a vector value to a matrix valued function, you can just call tape.gradient many, many times. And here, I have a very, very simple function that is just the explanation of the square of a matrix. And I want to compute the Jacobian. And I do this by writing this double for loop, where for every row, for every column, I compute the gradient with respect to the row and column output, and then stack the results together to get my higher order, Jacobian tensor. This is fine. This has always worked. However, you can replace these explicit loops with tf.vectorized_map. And one, you get a small readability win. Because now we're saying that, yes, you're just applying this operation everywhere. But also, you get a very big performance win. And this version that uses tf.vectorized_map is substantially faster than the version that doesn't use it. But of course, you don't want to have to write this all the time, which is why, really, for Jacobians, we implemented it directly in the gradient tape. And you can call tape.Jacobian to get the Jacobian computer for you. And if you do this, it's over 10 times faster on this example than doing the manual loop yourself because we can do the automatic vectorization. But the reason why I opened this black box and showed you the previous slide is so you can know how to implement something that is not a Jacobian but is like Jacobian yourself- and how you can use TensorFlow's automatic vectorization capabilities together with the other tools you have in your research to make you more productive. So remember to use automatic vectorization, so you can write short code that actually runs really fast. And let us add the batched dimensions ourselves. And here is another interesting performance point. Because with TensorFlow, we have always had the big, rectangular array or hyper-array, the tensor, as the core data structure. And tensors are great. In a world where we live in today, where we need to leverage as much parallelism as we can to make our models go fast, operations on tensors tend to be naturally highly parallel by default. It's a very intuitive API to program the capabilities of these supercomputers we have today, with many GPUs and TPUs wired together. And as long as you can stay within this tensor box, you are happy. You get peak performance. And everything's great. However, as deep learning becomes more and more successful, and as we want to do research on more and more different types of data, we start to want to work with things that don't really look like these big, rectangular arrays-- a structure that is ragged and has a different shape. And in TensorFlow, we've been recently working really hard at adding native support to ragged data. So here's an example. Pretend it's 10 years ago and you have a sentence. You have a bunch of sentences. They all have different lengths. And you want to turn them into embedding so you can feed them into a neural network. So what you want to do here is you're going to start with all the words in that sentence. You're going to look up their index in your vocabulary table. Then you're going to use the index to look up a row in an embedding table. And finally, you want to average the embeddings of each sentence to get an embedding for each-- the embeddings of all the words in a sentence to get an embedding for each sentence, which you can then use in the rest of your model. And even though we're working with ragged data here, because all the sentences have different lengths, if you think about the underlying operations that we're doing here, most of them don't actually have to care about this raggedness. So we can make this run very efficiently by decomposing this representation into two things-- a tensor that concatenates across the ragged dimension and a separate tensor that tells you how to find the individual ragged elements in there. And once you have this representation, it's very easy and efficient to do all the computations that we wanted to do to solve the task from the previous slide. You have always been able to do this manually in TensorFlow. We've always had the features and capabilities for you to do this. Now, with tf.ragged_tensor, we're taking over the management of this from you and just giving you an object, a ragged tensor, that looks like a tensor. It can be manipulated like a tensor, but is represented like this. And so it has ragged shapes and can represent much more flexible data structures than you could otherwise. So let's go over a little bit of a code example, here. Here is my data, same one from the previous slides. It's just a Python list. And I can take this Python list and turn it into a ragged tensor by using tf.ragged.constant. And the right thing is going to happen. TensorFlow is going to automatically concatenate across the ragged dimension and keep this array of indices under the hood. Then I can define my vocabulary table and do my lookup. And here, I'm showing you how to do your lookup or any operation on a ragged tensor where that operation hasn't actually been rewritten to support raggedness. You can always use tf.ragged.mapflatvalues to access the underlying values of your ragged tensor, and apply operations in them. Once we dedicate an embedded matrix, also, many of the TensorFlow core operations have been adapted to work with ragged tensors. So in this case, if you want to do a tf.gather to find out the correct rows of the embedding matrix for each word, you can just apply your tf.gather on the ragged tensor, and the right thing will happen. And similarly, if you want to reduce and average out the ragged dimension, it's very easy to do. You can just use the standard tf.reduce_mean. And the nice thing is that, at this point, because we've reduced out the ragged dimension, we have no ragged dimension. And we just have a dense tensor that has the original shape you expected to have. And I think this is really important, because now, it's much easier, much more intuitive and affordable for you to work with data that doesn't necessarily look like the big, rectangular data that TensorFlow is optimized for. And yet, it lets you get most of the performance that you'd get with the big, rectangular data. It's a win-win situation, and I'm really looking forward to see what interesting applications you all are going to work on that use and exploit this notion of raggedness. So please, play with tf.ragged. Try it out. It's very exciting. So next up, we're going to go over a particular, interesting example of research done with TensorFlow. And Achshai here, who is a PhD student at Stanford University, is going to come and tell us all about convex optimization layers in TensorFlow. Thank you. [MUSIC PLAYING]
B1 ragged tf tensor matrix gradient kernel Research with TensorFlow (TF Dev Summit '20) 1 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary