Subtitles section Play video Print subtitles ALEX PASSOS: I'm Alex Passos, and I'm here again to talk about functions not sessions. This function is the new way of using graphs in TensorFlow in TF2. All the material I'm going to cover here, the design and the motivation, is mostly described in one of the RFCs in the TensorFlow community GitHub repo. So if you go to GitHub.com/tenso rflow/communities/rfcs, you will see on our see an RFC with exactly this title in there where we go for a bunch of the motivation and a bunch of the high-level design. So here I'm mostly going to focus on some nitty gritty details of the motivation and more details about the implementation. And things that if you're working on TensorFlow and you're using functions to do something or you're curious about function internals, I hope to at least point you to the right places to start reading the code to understand what's happening. I'm mostly going to focus today on the high-level Python side of things. And there's another training session later. I think the title's going to be eager execution runtime. That's going to focus more on the C++ side of stuff. So I think to understand functions, it helps if you understand where we're coming from, which is the session about run world in TensorFlow one. And I think in TF1, when TF was originally designed, it was designed as a C++ runtime first and only later came a Python API. And as far as a C++ runtime goes, the API of graphs and sessions is pretty reasonable. So you build a graph by some function that the runtime does not care about, and then you connect to the runtime by opening this session. This connection is important because a runtime can be local, can be distributed. There are all sorts of in between things. And to actually run computation, you just call session at run. Because you have a graph, you give it the names of your inputs, the names of your outputs, the names of particular nodes that you want to run. And the runtime will go do its thing and return to you the results in C++ normal arrays that you can use to manipulate your data. So this is very nice and convenient if you're writing in C++ and if you're programming at this level. You generally write the code that looks like this once, and you spend your entire life as a TensorFlow developer writing the little part that I abstracted out called BuildMyGraph. And I think it's an understatement to say that just manually writing protocol buffers is very awkward. So we very, very quickly decided this is not a good way to go and built an API around it. And the first version of the API was very explicit. So you created a graph, and then every time you created an op, you pass the graph as an argument, and this is still fine because it's very explicit that you're building a graph. So you can have this mental model that you're building a graph that then you're going to give to a runtime to execute. This is not really idiomatic Python code. So it's also very easy to see how to make this idiomatic Python code. You just stick the graph in a global context manager and add a bunch of operator overloads and things like that. And you end up with code that looks like what TensorFlow code looks like today, which is a little unfortunate, because the same code, by reading it, you can't really tell whether an object is a tensor, and hence, only has a value doing an execution of a session, and is this the third quantity et cetera, et cetera, that has a name that might have some known properties about the shape, but not all. Or if this is just a normal Python object or non-pi array. And this creates a lot of confusion and I think leads to a very unnatural programming model. The session.run thing also has a granularity problem, which is that in the way it was originally built, the graph, like, the stuff that you pass a session of run, is a quantum of all the stuff you want to execute. And around it is this very rigid boundary where you keep stuff in the host memory of your client program, give it to session.run, and then get results back into host memory of your client program. So one example that I think is illustrative of why this is not ideal is if you have a reinforcement learning agent that's implemented over a recurrent neural network, in that scenario, your agent's going to run a loop where it's going to read an observation from your environment, which is some arbitrary code that runs in your host, and has some state. The states initialize at 0. They look at the observation. It runs it for a neural network. And that neural network spits out a new stage and an action for the agent to perform in the environment. You take that action, bring it to client memory, give it to the C++ code for an environment, your Atari game, or whatever. That will run for a while and then give you back a new state. You want to shift this new observation, you want to ship this new observation in the old state back to the RNN. But if your RNN is running on another device, say, a GPU, there was really no reason for you to ship your RNN state back to your client and then from the client back to the device. So the boundary for stuff you want to run here is not really working. The boundary for stuff you want to run is not the same as the boundary for stuff that wants to live in on a device or wants to live in the host. And this gets even more complicated once you put automatic differentiation into the story, because TensorFlow uses the symbolic representation for your computation that we call a graph. We do automatic differentiation on this symbolic representation. So now the graph not only has to be a quantum for stuff you want to run, but it has to be a quantum for stuff you differentiate. So if you stick to this reinforcement learning agent example, a popular thing that people used to do before we have now substantially better deep reinforcement learning algorithms is policy gradient. And the simplest policy gradient, it's got reinforced. And what it amounts to doing is it will run your agent for your m time steps. You'll get a probability for the agents to take the actions that it actually took. And you'll take the gradient of that probability, multiply by the [? reward ?] your agent got, and apply this to the weights. And now not only do we want to avoid transferring the RNN state back and forth between the host and your accelerator, but also you want to back prop through a number of steps that might not even be known before you start your computation. Another issue is that session.run has a kind of like-- it asks for too much information every single time. So what every training loop or inference loop or anything user TensorFlow looks like is not-- well, not every, but what most look like is not a single culture session.run, but a bunch of culture session.run in the loop. And in all those calls, you're executing the same tensors, you're fetching the same tensors, and you're feeding the same symbolic tensors slightly different numerical values. And because the session.run API doesn't know that you're going to be calling those things in a loop where most of the arguments-- where some things don't change and some things do change, it has to re-perform a bunch of validation. And so we put a cache in front of that validation. And a cache key becomes a performance problem. Derek had the idea of just separating the stuff the changes from the stuff that doesn't change into this session.makecallable API where you call it once with this stuff that doesn't change, and you get back a function that you call with just the stuff that changes. So now all the validation that you're performing n times is off the stuff that changed. And the validation that you're performing only once is of the stuff that stays the same. This is not just a performance win, but it's also kind of a usability win, because just by looking at the call to your code, you know what is fixed and what is variant. And finally, the last like, very awkward thing about session.run is that graph pruning is a very complicated model to program to when you're writing in an imperative host programming language. So for example, I have my first function in there where I create a variable, I assign a value to it, I incremented a little bit, and then it returns something that uses the variable time some constant. And if you just write code like this, because I didn't look at the return value of a [INAUDIBLE] assign add, that assignment will never happen. Like, there's no way to make that assignment happen in TensorFlow because you created a tensor and you threw it away, and you did not keep a reference to it so that you can session.run it later. And you think well, that's crazy. Why don't you just keep those references under the hood and do something magical to fix it? And the problem is that it's very easy for you as a user to rely on the fact that this pruning is going to be performed to try to encapsulate your code a little better. So design pattern that I've seen a lot is that when you have some structure-- so for example, my fn2 there has a reinforcement learning environment. And that nth object is some complicated Python thing that knows how to build a bunch of graphs. And you can encapsulate that in a single function in your code that returns how to get the current observation, how to apply an action, and how to reset that environment. So your code is now very concise, but in practice, you have a function that returns three things. And you never want those three things to run together. You always want, at most, one of them to run at any point in time. So this is a little frustrating, because we've kind of locked ourselves out of being able to fix this problem. And TensorFlow has a few partial solutions to this problem. I think the most comprehensive [? parts ?] solution to the problems in session.run is called a partial run. But it's inherently limited, because it requires you to have a fully enrolled graph. It does not work with arbitrary control flow. And it requires a complicated dance of like, specifying everything you're likely you're going to fetch in the future, then the things you're going to fetch now, and keeping them, passing tensor handles around. And it's very, very easy to make mistakes when you're doing it. Plus, what happens is that you as a user often writes a Python function. TensorFlow then runs the function to create a graph. Then we take a graph, we validate it, we prune it, we do a bunch of transformations. And we hope that what we got out to run is exactly the nodes that you had intended to run in that Python function in the first place. But because we have all these steps in the middle, it's very easy to drop things and confuse things. So all these usability problems are inherent I think, to coupling this session run API with this host programming language that tries to make your code look very imperative, like native Python code. And so the way we break this and solve those problems is with tf.function. So what are the core ideas of tf.function? It's that your functions inputs and outputs, they live on devices. They don't have to live on the host. Another thing is that a function is differentiable, and a function is an execution unit. But it's not forced to be the whole execution unit or the whole differential thing. Like, you should be able to differentiate from many calls to an execution. And you should be able to make an execution unit out of many functions. So this way you get to break that like, single quantum of work requirement of session.run and be able to write your programs in a more idiomatic way. AUDIENCE: Could I clarify the first point? I assume device also includes CPU device? ALEX PASSOS: Yes, it also includes CPU. AUDIENCE: --host memory. ALEX PASSOS: It lives on host-- AUDIENCE: [INAUDIBLE]. ALEX PASSOS: No. It lives on the host memory. But it's not immediately accessible to your client program. Like, the act of running a function does not immediately-- does not require that its inputs are visible to the client program, and does not immediately make its outputs visible to the client program. To make the outputs visible, you need to run an operation. And to make the inputs-- to ship them into the runtime, you need to run an operation. So we put the boundaries not at every call, but at when you want-- when you have the data and when you get the data. Another fundamental property of tf.function is that the notion of what should run is not a property of control edges or graph pruning, but is a property of the stuff that happens in the body of the function. So while we trace the Python code to build the function graph, any stateful operation that ends in there must run. And it's up to the TensorFlow runtime to run those operations in an order that is indistinguishable from running those operations in order as far as the user is concerned. And finally, every bit of state that outlives the function call should be an argument, either an explicit argument passed by the user from Python, or an argument that's implicitly capture like, closure capture or something like that that's just passed to the runtime. And by making all the state [INAUDIBLE] function call an argument, we get to enforce property three without too much complexity. And incidentally, once you've looked at those requirements, you kind of see why we have eager execution. Because once you have the ability to run functions like that, really, every single operation should act like a function if you just run it, which is why I think your execution is important as a mental model, even if in practice, you might want almost all of your code to live instead of a graph for performance or for the playability to light run times and things like that. So once you do go that way, there is a problem, because you now can take an arbitrary piece of TensorFlow code and run it eagerly or run it inside a tf function. And this means that any semantic difference between those two modes is going to break the abstraction barrier, and can cause a lot of lost productivity, which is why we want to do things like autograph, which is why we want to do things like automatic control dependencies, and try to have TensorFlow work as hard as it can to reduce those differences and make them explicit and raise errors as soon as we can so that we don't get-- lock ourselves out in a bad stage. And the really big caveat, that is the easiest place for you to make mistakes is when it comes to variables, because variables in tfv1 is-- if you've watched the previous video, you should know-- they behave very differently from how you'd expect a variables to behave in an eager way. So this is one of those places where naively writing code can very easily get you in trouble even with code that looks very reasonable. And my favorite short example is this, a function that creates a variable and returns an operation that uses the value of the variable. If you run this into tfv1 or in graph mode, what will happen is you will run this Python code once, and then you'll call a session.run on the result many times. And moreover, as a side effect of running this code, that variable is going to be added to some global graphical action that will allow you to modify its value later, even though it's completely scoped inside this function. So in tfv1, you run this function, it uses a single variable, and you get to modify its value, and then call session.run on the result and get different numbers out. But in eager mode, every time you run this function, we create a new variable, do a matrix multiplication, and throw it away. So if you have code like this, your code is something that's going to visibly see the differences between TensorFlow v1 and TensorFlow v2. So we just disallow this. And there are a few reasonable options we could do with tf.function, which is to say that tf.function should follow the eager semantics for variables. So when you have code like this, we will insert a create variable op, insert a matmul, insert a destroy a variable op, and then return the value of the matmul. We could also choose that tf.function is going to follow v1 graph semantics for variables. And every time you create a variable, we reuse it based on the name or something like that. These are very easy options to implement. A third option that also very easy to implement is just disallow creating variables in tf.function. These are reasonable options relatively straightforward to implement, and not what we chose to do. What we choose to do is a compromise that would allow us to turn more code into tf.functions while avoiding allowing code that would behave differently in Eager and in graph mode, and while avoiding breaking the expectations of code that was written with tfv1 in mind and then got wrapped into tf.function so it works inside tfv2. And the compromise we adopted is that if you just try to create a variable inside of tf.function that's disallowed and it create an exception. However, if you guard your code so that the variable is only created the first time the function is called, we allow it. And the reason why we allow this is that if you run a function like the one in the bottom eagerly, it'll create a variable a single time, use it many times. And that will have exactly the same semantics as just calling session.run many times on the result of the function on top. So this way by like, writing your code in a way that promises that you will respect the semantics, that you'll act in a way that does not see the difference in semantics between eager and graph, we allow it. Right now we have a relatively big hammer to detect these failures. But I expect that over time, we'll make this a little more precise, and we'll start allowing more and more code. So for example, an issue now is that if you create a tf.function, and you pass something like an optimizer as an argument to the function, trace it once. The first time you use an optimizer, the optimizer might create some variables for the internal state of [? adam ?] or [? adigrad ?] or momentum or something like that. But now if you pass a different optimizer, this different optimizer might try to create its own variables. And this is again, perfectly safe code, because they're passing the optimize as an argument, which means we're retracing the function. So we're capturing the graph creation. But currently, as of the recording of this video, we raise an exception. It's an exception to we can probably stop raising if we're a little more precise. And I'd like to do this at some point. But the idea is that I want to expand the scope of code that we allow that creates variables in tf.function until it encompasses-- as long as it still encompasses code that behaves the same in eager mode and in tfv1. Just because this way there is less of a mistake. Once there is no more tfv1 code out there, then I think we can flip the flag and allow all variable creation inside tf.function with fully eager semantics. But that's going to take a while. AUDIENCE: How do you detect like, if the code is in a disallowed format? ALEX PASSOS: Ah, I will actually go for a slide of that later. But essentially, what we do is we run it twice. We run it once, see if you've created any variables. If you haven't created any variables, you're safe. But every time-- if you have created some variables, retrace your Python code again, see if it creates another new variables and raise an exception. And we also set it up that every time we ever need to retrace your tf.function, if you do create variables on a subsequent call, we'll raise an error. Another issue is that the Python is a very dynamic language. While TensorFlow graphs are very static-- TensorFlow graphs are not as statics as ZLAHLO graphs, but they're still very static when it comes to the types of things and the number of outputs of an operation. And sometimes also TensorFlow graphs are static when it comes to shape in that if you run the same graph building code, with input tensors that have slightly different shapes, we will generate different graphs. Where the more about the shapes, sometimes they're specialized to generate a faster graph that knows more information statically, or can do some assertions and validation statically, instead of having you do them at runtime. So tf.function has a choice, which is to either trace a function once and raise an error if we think you're calling it with arguments that are incompatible with the argument that we use to trace, or accept that we'll likely need to trace the function many times, and set a policy for how we do this. We chose to go with option two. And the policy mostly looks like we use Nest-- the tf.Nest library to unpack your inputs. Once you've unpacked your inputs, we kind of split them into Python objects and tensors. We replace the tensors by TensorSpec, which is the [INAUDIBLE] in the tf public API. And it just has a shape, a D type, and a name. And then we re-morphed the thing into a structure, and use that as a key into a dictionary. AUDIENCE: What's the name here? ALEX PASSOS: If you are in eager mode, there is no such thing as a name. But if you're building a graph, we do look at the name to-- the name of a tensor in graph mode, just to try to preserve a little more information. So we use this whole structure as a dictionary key. We actually have to not quite use the exact structure. We've replaced list with tuples and dictionaries of lists of pairs, and a few other things just to make sure that Python doesn't yell at us we're trying to put unhashable things in a dictionary. But the idea from a mile away, is that anything that you pass through a tf.function that is a Python object, if you change that in a way that a Python dictionary would notice, that will trigger a tf.function retracing. If it's a tensor though, we do explicitly do not key on the value of the tensor. We key only on its shape and type. And retrace only when the shape and type changes. This is a little controversial, because there are some types of Python values, specifically scalars and non-pi arrays, that you might want to treat as tensors. And this is a decision that we might want to revisit at some point, because it leads to a lot of problems for our users. But for now, we're conservative and retrace when you change the identity of Python values. So as I was saying, there are some downsides of this choice. And the two biggest ones-- or the first, is that we do too much retracing, as I just mentioned. And the second one is that shapes are hard. Specifically, the more we have static shapes, the more efficient we can make our graphs. And in practice, due to the hardware that we use and the way we write, you kind of want to have your code-- your graphs mostly have static shapes in them so that we can be as performance as possible. However, it's often very convenient to have things like a dynamic batch size or a dynamic sequence length. And those things might not even incur very large performance penalties. On GPUs, for example, dynamic batch sizes and static batch sizes, they tend to consume about the same amount of time, not necessarily on TPUs. However, if we try to relax the shapes as you call the function, retracing your code with a shape that partially in normal [INAUDIBLE] dimensions might call some graph building code to explode. So we have an optional way for you to-- we have a few ways for you to control this now. And we'll try to refine the default policy to make it better. But you can always choose the extremes. So we give you essentially, three knobs control tracing. So if you have a tf.function that you build over pi function, if you want to force it to retrace, you can always build another tf.function object. And two separate tf.function objects share in those state. This is a cheap way to force a retrace. This gets you around the limitation of not creating variables. It gets you around the shapes and things like that. You also have the flip side of this, which is to prevent retraces, you have two options. One is you can call get concrete function on the tf.function object. On that you pass a signature. And you get a function back that you can call with that signature. And it will specialize on the particular properties of that signature. Or you can pass a signature when you define your tf.function. That also works. And finally, you have an experimental knob whose behavior is likely to change in the future, that if you set it to true, will try to relax shapes for you. So if you know you're likely to have shapes with dynamic batch size or dynamic sequence lane, and a few other cases where running in your graph building code with partially unknown shapes is perfectly fine, then you can set this true and enjoy fewer re-tracings. We might need to add newer knobs. But I think our policy-- I think this is mostly fine. And we'll iterate and refine the existing tracing policy to make it better. AUDIENCE: [INAUDIBLE] tracing happen when you call the function the first time or every time? ALEX PASSOS: We try to trace as little as we can. So the first time you call it, we clearly have to trace. There is no alternative, because you don't have a graph. The second time we call it, if you call [? them with ?] tensors with the same Python objects and if tensors of compatible shapes and types, we will not retrace. But if you change the shapes and types of tensors, then we're likely to have to retrace them. AUDIENCE: Question. So when the trace cache key, does it include a global variable access [INAUDIBLE] function, or just [INAUDIBLE]? ALEX PASSOS: We do not put in the cache key that variables access by the function because I don't know how we would check this without running the Python code. AUDIENCE: So which means it may use a change in a type of [INAUDIBLE] we accessed [INAUDIBLE]? ALEX PASSOS: Yes. If you any kind of relying on Google Python state that is not an argument to the tf function might lead to breakage. Yeah. The Python is a funny language, because you can actually check this. You can take a Python function. You can get the transitive closure of all the modules and objects it has access to. The problem is that this ends up being in the thousands and thousands of symbols. So we can't feasibly check whether the value of any one of those has changed between function executions. So this is kind of best effort. And again, if it's a little bit of a caveat. So if you have global state that you want the tf.function to depend on, put that representative state of a tf.variable. Because if you change the value of a tf.variable, a tf.function will see it. AUDIENCE: Another question-- so in the actual tracing is [INAUDIBLE] by running that function. Certain functions have a side effect, say [INAUDIBLE] file. Can this be executed twice? ALEX PASSOS: Then again, if you want your side effects to happen as the function is-- your Python code is only going to get executed of all the tracing. So if there are side effects that you care about, you should make those side effects BFT side effects. So don't use Python's file writing, use tf's file writing. Don't use Python's file reading, use tf's file reading. Don't use Python random number generations, use tf [INAUDIBLE] number generators. In general, anything that you try to make a tf thing, is the way to make it work reliably tf.function. And part of this is due to Autograph. I'm not going to talk about Autograph here. There will be another separate training talk on it, because it's full of like, very interesting and cool little bits. AUDIENCE: I don't get how using the [INAUDIBLE] version of these APIs [INAUDIBLE] side effects [INAUDIBLE].. ALEX PASSOS: If you use a tf version of this API, like, if here's a tf thing to write to your file or to generate a random number, if you run it in graph mode, we don't do anything. We just create a symbolic expression that when evaluated, will have the side effect. AUDIENCE: So it actually doesn't execute it? ALEX PASSOS: It doesn't execute it in graph mode. In eager mode, it will execute it. But in graph mode, it just builds a graph that when executed, will have the desired side effect. AUDIENCE: So just to [INAUDIBLE] decide to use Python [INAUDIBLE] It's undefined behavior, essentially. ALEX PASSOS: Yes, it's undefined behavior. And if you want to define it, you need to control how often you trace the function. You can choose also to force Python things to happen using tf.pi function or tf.numbyfunction, which will run Python code at function execution time by explicitly delineating the Python code that you want to be dynamic. This has some limitations though, because we're not very good at shipping Python code from one host to another host. So in general models that rely on pi function or non-pi function, they're not serializable and they do not run well in distributed settings. So how do we make this work in practice? And I think here, I want to give you a walk-through of interesting pieces in the code, some screenshots, some like, lightly rewritten for readability bits, so that you know what things to look for if you want to understand or change the behavior of tf.function. So the first structure that I think is particularly interesting is the fun graph. And the fun graph is a subclass of the TensorFlow graph that overrides a lot of interesting behavior. It's where the code to do automatic control dependencies lives. It's where the code to do Autograph lives. It's also where we do closure capturing. And closure capturing is maybe the most interesting part, because in normal TensorFlow graphs, you try to use a value from outside of the graph, you immediately get an error. But with functions, most programming functions, you expect to be able to use values from the defining context. So fun graph has some logic to do that. And it has two capturing modes, capturing by value and capturing by reference. By default, we capture by reference. But we can turn capture by valuing off. And the way we do this is that when you try to create an operation in the graph, we look at the inputs and capture them if we have to. The way this is done is by creating a placeholder that has the same shape and D type as the tensor that you're trying to capture, and storing inside the fun graph a map from the tensor that we captured through the placeholder that we created, so that later when we call the function, we feed in that tensor as the value for the placeholder. We do this for every external value. Like, we do this with constants, for eager values, for graph values. The way this is setup, capturing a value is something that is visible to the gradient code. So you can differentiate a function with respect to its invisible variable captures. There's a lot of like, little subtle issues there to get right. But the general idea is that we're going to create placeholders of the proper shape and type. And at function call time, we're going to try to pass the original argument as the placeholder. And the nice thing about this is that the point where you try to pass the vision argument is a placeholder, if that is happening inside another fun graph, and the original argument does not belong to that graph, that will recursively trigger a capture, which will recursively trigger a capture. And this way will properly propagate value capture throughout functions that call functions that call functions that call functions. And all of those are going to correctly handle differentiation and a few other things. So it should be mostly seamless. Another thing the fun graph does is it has the core code to take your Python function and build a fun graph from it. As you can see, it has many, many options. There are lots all sorts of ways you can control this. You can override shapes, you can capture by value, you can add automatic control dependencies, you can do Autograph, you can pass a signature. But this is like the general workhorse that we use every time we're chasing the Python code that you pass this to create a tf.function. And one of the particularly important things that it does is do automatic control dependency. And what this is trying to do is enforce that inside tf.function program order is execution order as far as the TensorFlow runtime is concerned. When we also try to do this in a way that does not harm performance-- but if you remember last week, the video on resources and variables, we're moving to a model in TensorFlow where all stateful ops manipulate an explicitly named resource. And what this lets us do is that the first version of the automatic control dependencies code was this. Now it's a lot more complicated than that because it tries to handle ops that are stateful that do not have an explicitly declared resource. And it tries to handle control flow v1. So it's far messier. But essentially, all you do is you iterate over all your ops in the graph, look at every input of an op. If an input is a resource, you just add a control edge from the last stop that used this resource to this op. And you overwrite a map to make this work. And finally, you just return all the like, at the bottom of the function, for every resource, the last op that was supposed to have used it so that we can make those operations control outputs of a function. And these control outputs are important because we can then tell the TensorFlow runtime not to accidentally prune the side effects that we intend to happen. And so if you have operations in a function that an output does not depend on, and that no side effects depend on, we know that we can prune those. And this also means that as we move to a model where TensorFlow is more compiled instead of interpreted, the fact that these controlled dependencies enforce a linear order, means it's relatively easy to-- well, easier to take TensorFlow code and turn it into some intermediate representation that will feed into a compiler, which you might hear about in the MLIR talk. So this just clarifies the semantics and removes a lot of weird behaviors you can get with tfv1. And as you can see, just because we are tracing-- what this assumes is that code that-- we just want to execute code in program order. And that's a far, far easier thing to guarantee than a complicated pruning and partitioning process like the one we have to rely on in tfv1. The fun graph though is a Python only construct. And to actually turn this fun graph into something that we can execute, there's a thing in the CAPI that takes up tf graph and generates a tf in the score function out of it. It again, has a bunch of options so that you control your inputs, your outputs, you can uniquify things, et cetera. But if you're looking for where do we turn a TensorFlow graph into a function, this is the entry point that you want to look at in our CAPI. However, now we need to call those functions that we've created. And technically, you can call a function just like you would call any op. Once you've registered a function with a intergraphs function library or into the eager context, you can just use TFE Execute or TF NewOperation to define a function call, just like you would define any operation execution. Under the hood, this is going to use an op named StatefulPartitionedCall. So if you look at the source code for that operation, you will see the place where we partition the function graph into multiple devices, in case you want to run your function over multiple GPUs or TPU cores. It's also where we run a lot of [INAUDIBLE],, which is-- I hope there will be a separate training talk about [INAUDIBLE],, but it's this cool thing that runs all sorts of graph optimizations and back-end specific rewrites and things like that. And the place in the Python code where we take you trying to call a function, and we turn it into this partition call, is this class called on underscore EagerDefinedfunction. So if it's search for that in TensorFlow source code, you can try to read like, how do we exactly set up the function call not correctly handled things like, things that were captured, correctly handling gradients, and things like that. Differentiating functions then, is built on top of the EagerDefinedFunction. And the idea is that if you have a function that you want to call, and that function's differential, we need to generate three things under the hood. One is what I call an inference version, which is just a thing that runs the function and returns the results. That's what you're going to call if you're not trying to differentiate though it. But because we do a reverse mode automatic differentiation in TensorFlow by default, to differentiate a function, we need to run a backwards version of it. And the backwards version might need to use intermediate values from the forward pass. So what we do is we generate a clone of your inference function that returns all the intermediate values that the grading code is likely to need, and then we make another concrete function, which is the backward thing. And the interesting thing is that the forward and the inference is just defined functions. There are these things that just can call themselves. But the backwards thing is a concrete function. So it also has the [? diff ?] code, which means you get to differentiate the gradient of a gradient of a gradient, of a gradient, of a gradient, of a gradient of a function, because it recurses in that direction. So we get a closed system for automatic differentiation, which is really important, because as machine learning research moves forward, generally, more and more algorithms end up relying on limited forms of higher order [INAUDIBLE].. One thing that you might think about right now is that a particularly important feature to a lot of people who rely on reverse model to diff is to not keep all the intermediate state alive in between the forward and backward pass. And you could make this a feature of tf.function, but I think it should be implemented separately, built on top of tf.custom gradients since this is just another customization of the differentiation code. And it's very easy to do this generically. No, it does not depend on tf.function. And this is being added to the v2 API now, even though it exists in [INAUDIBLE] for a while. So the rematerialize recompute gradients thingy is not going to go away. But it's also completely orthogonal [? to ?] tf.function, which makes me a little happier, because smaller pieces that compose tend to be nice. And here, on top of this differentiation code, we have the code that goes from an abstract function to a concrete function, which is the thing that does the function cache. And here we start getting into a little bit of cruft, because for a while, the tf function code base was in Contrib Eager . [INAUDIBLE] And so it had this class function.function that did the-- going from the cache key. And the difference between the Contrib Eager [INAUDIBLE] and the tf.function we have today, is that we have fixed the story around variables to be much more well behaved. So if you look in our code base, you see the function of Py file has a function class which does the cache key logic. And the cache key is mostly implemented in C now. There's this CAPI called TFE Pi Encode Arg that takes a bunch of arguments, replaces tensor with TensorSpecs, and uses this to form additional key. We did this in C because a little faster than doing it in Python, and gives us a little more control over how we handle lists and dictionaries and things like that. If you're interested, that's where you'd want to see it. And finally, the last bit of the pile of complexity that we've been looking at, is how we do the variable lifting and initialization. And this is handled in another class, confusingly and also named function, but in the different file called DefFunction.function. Internally, it calls the other function class. So at least we have proper layering. But we do need to clean up the naming of these things a little bit. And the variable lifting is a little tricky. So I'll give you a quick walkthrough of how it works. First thing is that we define our own type of variable, which is a subclass of the normal tf variable. And the idea is that when you create this variable, it inserts in the graph a control flow conditional, where if it's not initialized, it initializes the variable. This way, because the graph at table creation time has this conditional, by the time you get to use the variable, you've already passed this. So if you run this graph, you're guarantee to only initialize this variable once. And now I see its value initialized. But this is a little sad, because if you've used TensorFlow for a while, you know that [INAUDIBLE] is somewhat expensive. So what we do is we try to-- the thing that I said earlier, we trace your code twice. First, we trace of a thing that captures all the variables that are created. These are variable capturing scopes. I went over them a little bit on last week video. And the second time we trace, we trace of this scope where if you try to create a variable, we raise an error. And this is how we control the policy for only letting you create variables once. And now that we have these two versions of the function, one that we call Stateful, one that we call Stateless FN, we can put a cond in there where if all the variables are initialized, we can call the one that does not do any of the complicated computation, while if any of the variables is not initialized, we have to, for safety, call the function that has all the complicated bits. But now that we've built this monstrous graph that has your function inside of it twice, one with conditionals, one with [INAUDIBLE] conditionals, the whole thing inside a conditional, ideally, we would never want to execute this, because conditionals again, are expensive. And they have this particular property where you pay for the nodes that don't get executed in the current version of the TensorFlow runtime. So what we do on top of that is we try to lift the initialization code out. So we look at every variable. And we call this lift to graph thingy, where we try to copy into a separate graph all the initializers of all the variables. And this copy is set up in a way that erases an exception we control if we ever find a variable whose initializer depends on the value of another variable, or depends on a function argument, or depends on something that we can't cleanly isolate. So for the common case, where your variables are all independent of each other, we don't actually run any of those complicated graphs. We just run the simple graph, because we can run this initializer once. It's only if we cannot live the initialization graph, that stuff breaks. And this lift to graph thing is actually a pretty nice internal TensorFlow library that you can use to manipulate graphs. You give it some tensors, a graph, some sources. And it will walk the grapher back, and copy all the things that you need from that tensor to your target graph, and return a map from every source the target tensor of the copies it did. So you can use this map to run things in a target graph as if they were in the source graph. So this is mostly it about the Python level runtime for code for tf.function. But I'm not going to talk about the TensorFlow runtime today. This is for another meeting, because we're mostly running out of time. So any remaining questions before we go away? AUDIENCE: One question on performance. Let's say we have tf graph. It's got multiple [INAUDIBLE] matches so I can run different subgraphs. They have a lot of overlapping. Now if I compare that to a set of independent functions, things like the size of the whole model will go up. [INAUDIBLE] ALEX PASSOS: So in principle, the size of the whole model would go up. But if you were to convert this to [INAUDIBLE] tf.function, hopefully you'd convert this to functions that call each other. And how with normal programming language code, as soon as they have functions that call each other, you can control the complexity of your code, especially if you have calls to the same function repeated. And you can end up with a smaller side. So as far as the finding goes, in practice, using tf.function often leads to smaller graphs, because you often end up calling the same function multiple times. At execution time, we can inline those functions for performance. I think right now we essentially always in line will form tf.functions. So if you are really calling like, the power set of the nodes in your graph, then you would see an explosion on the things. But in most of the models I've seen, we can avoid this. As far as I can tell, the performance overhead for using tf.function now comes from the fact that if you're creating variables, you need to trace your code at least twice and generate all those extra conditionals. And this is something that I think with a little more engineering, we can make it only happen when it's strictly necessary instead of always happening, and the back off optimizations being optional. If we have no more questions, then I think we're good. Thank you. [APPLAUSE]
B1 graph tf inaudible variable session runtime Inside TensorFlow: Functions, not sessions 4 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary