Subtitles section Play video Print subtitles ALEXANDRE PASSOS: Hi. My name is Alex. And I'm here to tell you today about resources and variance. And really this is a talk about state in TensorFlow and stuff that got accidentally represented in state in TensorFlow for far too long. So what is state? I would love to be able to stand here, or rather sit here, and tell you that an operation is stateful if either executing it has a side effect, or if its output depends on something other than the value of its input. But this is not what TensorFlow means by state flow. Sadly, TensorFlow goes by the [INAUDIBLE] notion that the meaning of a word is defined by its usage. So state in TensorFlow is defined by this one bit that gets flipped and means all sorts of very interesting things. So, for example, this slide is wrong. tf.print is stateful. It has a side effect. Yay. tf dataset from tensor slices has no side effects, because the data set operations are value types, and they're stateless. And yet, that kernel is marked as stateful. Because one of the effects of marking something as stateful in TensorFlow is that it disables constant folding. And constant folding can be buggy with data sets. Iterators, on the other hand, are stateful. This might lead you to think that there is some meaning to this. But there are also some things in TensorFlow that could go either way. So to differentiate while loops, we have stacks. So that when you're doing the forward press of the loop, you push things into a stack. And when you're doing the backward pass, you pop things from the stack. So you can look at intermediate activations and stuff. And those things were stateful in tf V1, but they're stateless in tf V2. Tensor lists that you can use to aggregate stuff from many iterations of a loop into a single view, or do the reverse, they're also stateful in tf V1 and stateless in tf V2. AUDIENCE: Is that because we didn't invent the stateless way until later? ALEXANDRE PASSOS: Because we did not invent the stateless way until later. Yes. So I want to spend the rest of the stock talking about how statefulness is represented in tf V1, some of the problems with that, how we're fixing those problems in tf V2, and how we can deal with state, and also with things that are not necessarily easily representable with dense tensors. So how is statefulness represented? In one of two ways-- the most obvious way is that if you will go on the TensorFlow source code, and you find where ops are registered, you will see this bit. Set is stateful. And the definition of state in TensorFlow is that opdevs that have this bit set are stateful. And all sorts of places in the runtime are going to look for that bit and behave differently if that bit is set. And people set the bit because they want any of those behaviors. And this is something we need to clean up. And I think we might have a chance to clean this up with the MLIR dialect of TensorFlow, which is going to have more finer grained bits. But until then, we're stuck with this one bit that has too much precision. So among other things, what does this bit mean? It means that TensorFlow will not do constant folding. This includes the two or three separate systems in TensorFlow that do constant folding. All of them know how to bypass stateful operations. Similarly, there are at least two different places in TensorFlow that do common sub expression elimination. And they refuse to do common sub expression elimination of stateful operations, which is very good, because if you're to do that, and you have a neural network with many layers, and your layers are initialized from a random op, all of the layers with the same shape would be initialized with exactly the same random values. AUDIENCE: And all your prints would potentially be collapsed into a single print. ALEXANDRE PASSOS: Only prints of the identical string would be collapsed into a single print. Because otherwise we would have enough information to disambiguate those. But statefulness also means some things that are not very obvious at all, like the op kernel instances that the runtime uses to represent the computation to run are reused across sessions for op kernels that are for stateful ops that have the same name. And there are also a somewhat long tail of obscure behavior changes, like parallel four behaves slightly different for stateful operations. And people are known to set a stateful bit for any one of these reasons and more. The other way of representing state in tf that we're trying to get rid of in tf V2 is the notion of a ref tensor. And going back to the variable op, it is this thing here, where you can say that a tensor is either of a D type, or of a D type ref of D type. And what that means is that the reason why we did that is that it's very convenient in many cases to be able to keep information in the runtime that persists across call session.run. Specifically, the variables-- if you had to write your code like this, where every session.run you'd feed your variables and then you'd fetch them back, and you were doing some kind of distributed training, you would have so many network round trips and so much extra latency for this, it would be completely impractical. So the idea of the variable op, which is the thing that motivated the ref tensor, is like a constant, but mutable. And if you try to dig for the runtime, you'll find this piece of code, which I think is the most concise representation I could find of how do we represent the distinction between a ref tensor and an auto tensor. This is what the input to an op kernel looks like. And it's essentially a manually implemented ABSL one off, where one, it's either a manually constructed tensor-- and the manual constructor isn't there, just so we don't try to re-initialize it in case we're not going to need it-- or the spare of a pointer to a tensor and a pointer to a mute x. And if you've ever programmed in C++, you should be terrified right now, because you see a pointer, and you see no comment about who owns this pointer, and what is the lifetime of that pointer? And a good third of the issues of ref variables come from the fact that it's been impossible or very hard to retrofit into the system a coherent notion of ownership of this pointer that's going to be memory safe. But that's not all. The way the ref variables work is that you have a graph that looks like this. You have this variable node whose output is a tensor that can change, and you can feed it to an operation that mutates it, like assign, or you can feed it to an operation that does not mutate it, like identity. If you feed it to an operation that does not mutate it, like identity, the TensorFlow runtime will silently cast that tensor star to a tensor. So make another tensor object that aliases the buffer pointer by that tensor, and just keep going. So the reason why I like this graph is that it's short. It's simple. If you look at every single gradient update that we use for training, it kind of looks like this. But it's also kind of tricky. So we have, I don't know, like 20, 30 people in the room now. Can I get a show of hands on who thinks that the result of the print is the value after the assign? No one. AUDIENCE: What do you mean? The print? ALEXANDRE PASSOS: So this graph, it has an add that takes as input the identity of the variable, and some constant. And it prints. And it has a controlled dependency from an assign that mutates the value of the variable to the add. So how many people think this is enough to ensure that add will see the value of the variable after the assignment? OK. About five or six. AUDIENCE: Yes. ALEXANDRE PASSOS: How many people think that add will see the value of the variable before the assignment? About two or three hands. How many people think this is a segmentation fault? [LAUGHTER] No one. And how many people think it depends on things that are not written in the graph? AUDIENCE: 100. ALEXANDRE PASSOS: OK. So all of you have been bitten by this, because I got like 15 hands now. Is this completely non deterministic, and it depends on all sorts of runtime properties? For example, if everything is in the same device, and the assign does not change the shape of the variable, because of the way we do aliasing inside the TensorFlow executor, print will print the value after the assignment. However, if the add is in a different device from the variable, then most likely there will be an RPC, and add will sometimes see the value after, sometimes see the value before the assignment. There is one case where add is guaranteed to see the value before the assignment, which is if the assignment changes the shape of the variable. Because if the assignment changes the shape of the variable due to intricate details of the implementation of tensor in tensor buffer, we do not change your existing tensor buffer. We just allocate a new one. And by the time the identity runs, it has already aliased the old tensor buffer. And you can get a seg fault here, as you might have guessed, if we're talking about string D types. Because add is defined for string types. And if you have two separate threads that are reading and writing to a string in C++, you're very likely to get a seg fault or some other weird behavior. So this is pretty complicated and kind of unworkable in the long term. You need to know about all sorts of things that are not well documented, and that rely on specific details of the implementation that are not guaranteed to stay stable. And if you were to try to design something like a compiler for TensorFlow, this would be really hard to make work. So we're not doing this anymore. And I'm going to spend the rest of this talk, hopefully, telling you about how we're fixing this and imposing some order onto the situation in tf2. And the interesting thing is that internally the way variables have always been represented in Tensor Flow-- almost always, I guess, since the first open source release is the state has been stored in this resource manager object, which has a create, a look up, and a delete method. And these can return some arbitrary type. We use some RTTI magic to make sure that the code is runtime and type safe. We even implement RTTI on compilers that do not have RTTI to make this work. RTTI, sorry, is a C++ thing for runtime type identification. And this is a perfectly reasonable API if you wanted to represent state outside of the graph. So the idea that we had in tf2 is let's use this to represent the state as essentially operations in the graph. So there's still some issues of resource manager, like its scope to device objects. And device objects have a weird lifetime. Sometimes they outlive a session. Sometimes they do not outlive a session. And it's slightly different with eager execution. And this can be very surprising in some cases, both when you're doing parameter server training, and when you're not, and you accidentally find yourself doing parameter server training, and intentionally serving parameters between two models that are not supposed to. But overall, it's a reasonable API. So what we did is we created a tensor D type, just like string, or int, or float, that represents the information you need to look something up in the resource manager. And we call this, creatively, DT Resource. The reason why this is a tensor is that this is just another value. So you can pipe it through a graph. You can stack things together. You can select them dynamically if you want. Or you can just use them statically. It's just a scalar most of the time. Like you can have non scalar tensor DT resources, but most of the interesting operations just want scalers. And then you can use this tensor to manipulate a resource. So internally it's, again, just information you need to make the lookup to the resource manager minimally type safe-- so the device, a container, a name. The container was this idea that we had originally that you would be able to run many separate models on the same parameter server and provide some kind of isolation where you could reset the variables in one model, but not the variables in the other model. The way this was implemented made this very hard to use. And I know very few people who rely on this now. So these days, it's mostly just [INAUDIBLE].. But otherwise, it has a name and some information to validate that you're looking up the object of the right type. But that's all there is to it. And resources are special cases in a couple places in the runtime-- not as many as the stateful bit. And one of them is that if you create an op that specifically manipulates-- either takes or returns a tensor of a resource D type, we mark it as stateful, because you assume that if you're asking for a key to something in a resource manager, you're probably going to monkey around with it. And this at least removes the redundancy, because otherwise you would have all these ops that would take resources, modify state in a resource manager, not be marked as stateful, and you would have to wait until they got accidentally constant folded together to see something break. And the second one is that the placer will always co-locate operations that manipulate a resource with the device where the resources is in. And this is because you can't really modify a structure that's in another computer without running code on the other computer. But mostly resource handle is safe in the runtime. And the interesting thing is that now our graph that was very hard to read looks like this. You have this far handle op that represents the resource handle, the key. And you can pass that key to your assignment. You can pass that key to your read operations, et cetera. And now I'm pretty sure everybody should agree with me that this graph, as written, has to return the value of the variable after the assignment. Otherwise, it's a bug. And this is true. There is no weird non determinism. It doesn't matter whether the shape changes or doesn't change, what D type you're dealing with, what device things are on. Also, there is no way to make this seg fault, I believe. So it's substantially nicer. There's still some subtle things in here. One of them is resource gather. It's an operation that you would think, why would I need this? Because what it does is it does effectively what read plus gather do. But it does it in the single op. And the reason why we have this is that if you think about this, if I really want this to provide proof forever that this graph is allowed to have the same meaning of always reading the variable after the assign, and if you had flipped that control dependency between read and assign, you would now be always reading the variable before the assign, you might have to make a copy to ensure that this memory is preserved. And if you have a very large vector of embeddings, making copies of it can be very expensive. And we would like to provide good performance. So really this resource thing is more a specification of the meaning of a graph that has these operations and less the specific details of how they're implemented. It's possible to have many valid implementations of this, and they're going to have different performance characteristics. So, for example, if we load our graphs to XLA for computation, XLA can take a cluster of ops that have a bunch of reads and writes to variables, look at the state of the variables before they're clustered, figure out what the state of variables should be after the cluster, and rewrite it to be a bunch of reads, some stateless computation, and then a bunch of assigns. And this correctly preserves the semantics of these operations. And it's a perfectly valid way to do this. We don't always run XLA, though. And if you start thinking about this, there are two relatively straightforward ways you could implement variables. And they have pretty strong performance trade offs. A very obvious implementation is copy on write, where you would copy the buffer for a variable every time we write to it. Another one is copy on read, where the read operation is going to copy. And then the assign operation is just always going to mutate. The interesting thing is that copy on write, if all you're doing is your standard std training where you read a bunch of variables in the beginning, do a bunch of forward and backward computation, and then you write to the bunch of variables, you can do this with zero copies. Because by the time you're writing to the variables, there are no outstanding reads left. So yay. Similarly, if you have embeddings, and you are sparsely reading a few rows from your variable in arbitrary, random order, and then later on you're going to sparsely write to those rows, we can do this with no copies if we have copy on read. I mean, no extra copies. Since the reading would have to copy anyway, because it's reading in an unstructured way that we couldn't preserve, like strides or something like that. So which one do we choose? And effectively, we chose both. And we did this by storing a bit on variables and having variables always start in a copy on write mode. And as soon as you do any sparse operation on a variable, we grab an exclusive lock, and make any copies that we need, and put it on Copy and Read mode. This works reasonably well for both the only use this variable in dense operations case, and you only use this variable for embeddings' case. It's not necessarily generally the best idea. So I expect this policy might have to change and become more refined over time. But again, this is just an implementation detail. And this does not affect the correctness of the programs that are running on TensorFlow. So I think it's a big improvement of-- AUDIENCE: Can I clarify? ALEXANDRE PASSOS: Yes. AUDIENCE: So I thought when we read something, effectively it makes a copy. It seems like this copy is specifically in the context of [INAUDIBLE]. ALEXANDRE PASSOS: It pretends to emit a copy. So the definition of a read is that a read is guaranteed. An operation that looks at the output of a read is guaranteed to see the effect of every operation that had an edge pointing to the read and not see the effect of any operation that had an add pointing from the read. You can implement this by making a copy on read. You can also implement this by making a copy on write. You can also implement this in more complicated ways that might never make a copy. AUDIENCE: So our default copy on write looks at the reference count, and if it's one, just updates in place. And our default read operation just increments the reference count. ALEXANDRE PASSOS: Yes. AUDIENCE: The default copy on write implementation. ALEXANDRE PASSOS: The copy on write semantics do that. And I assume we're going to eventually switch to more complicated policies. For example, we could look at the graph, and then decide what policy we're going to use to write the variables on this graph. Or we could let users configure this. There are many options here, but ideally, we should be able to implement all of them without requiring that users change the graph structure to get better performance or to get correctness of their behavior. And this is what's important about this, because this means that we got to fundamentally and dramatically change the back end, like use a compiler, and not have to worry about preserving bug compatibility, what happens if you A list the output of identity on another variable, or something like that. So far I've mostly focused on how the runtime treats variables. But the same fundamental patterns of a handle tensor and operations that read and write to it is used in all sorts of other bits of runtime state in TensorFlow. This includes the data set iterators, FifoQueus, HashTables, and a few more things that I have forgotten. AUDIENCE: Are mute xs resources? ALEXANDRE PASSOS: Mute xs, they're a resource. But they also have a variant that represents the mute x lock object. So it's a slightly funner situation. But as far as the resource part of the mute x part is concerned, it's, again, a mutable resource tensor that has a handle. It has operations to modify it. So this is nice. And this is just essentially what the runtime looks like. And if you have this picture in your head, you should be able to mostly predict the behavior of TensorFlow programs that manipulate state. One other bit is TensorFlow, there's shape inference. I'm sure if you've looked at TensorFlow op registrations, you've seen annotations like this where we set shapefn. The result of shape inference is not persisted in the graph. It's ephemeral. It's produced every time we create a graph, or while we're importing a graph. But this is very, very useful to ensure not only that we know how to interpret the graph correctly and that the graph is valid. But this is very helpful during the graph building process, where user code can inspect the inferred shapes of nodes, and make different decisions as to whether things can be dynamic or static in the graph. And if all the resources are scalers, this would make it hard to do shape inference on stateful operations that manipulate resources. So we did kind of a hack that should be improved and added a side channel to the shape inference process, this output handle, shapes, and types that can store an arbitrary list of shapes and D type objects. And different resources and variants are going to assign different semantics to this. Operations like cast that do not affect the shape, just pass the shapes and D types through, and then operations that are aware of what the resource handles are doing are going to look at this and assign meaning to them. So variables just store a single shape in D type for the value of the variable. Tensor lists store a shape in D type in there for the shape and D type of the elements in a tensor list. Iterators store the shape and D types of all the tensors that you're going to get when you call iterate or get next so that we can properly do shape inference on those graphs. So now that you mostly have a reasonable picture of what resources look like in the runtime, I'd like to pop the stack and talk a little bit about the Python side. So this is going to mostly focus on variables, because I think there are a few interesting things in there that will, again, generalize through other bits in the runtime. The first one is that if you've used TensorFlow before, you know the variables act like tensors. You can pass them to operations. You can use the operators on them. And part of this reason is historical. I think the first implementation of Variable in TensorFlow was literally just the return value of the Variable op. And that happened to be a tensor of reference D type. Later we felt the need to replace that with a class. So we worked somewhat hard to make that class behave exactly like a tensor. And this is something that sometimes library writers downstream from TensorFlow want to have their own types that behave like tensors, or behave like variables. So how do you do this? And I strongly believe this is all you need. First thing to do is you need to make your type convertible to a tensor. So there is a tf.register tensor conversion function that takes the type and a function to convert that type to a tensor. In the case of a variable, it just reads the value of the variable. Easy. There are some special cases in there to deal with reference types that are no longer needed, thankfully. Another thing that you need to do is register your type as a dense tensor like type, which means that implicitly after a stack, by just putting many instances of that type in a list will work by silently reading and then calling stack. Then you need to overload all operators. And if you look, there's this method, overload all operators in the class for tf Variable that has implementation for this that will steal all the operator overloads from tensor. And there is a rule in TensorFlow that session.run is not allowed to add nodes to the graph. This can catch all sorts of terrifying bugs. So it's good that we have this rule. But if you want to be able to fetch the value of a thing, then you need to implement this underscore as graph element method, which session.run pokes to see if it is there, which is supposed to return a pre-existing tensor. And so variables have to record a tensor that is going to be the result of reading them. So storing there, you can use session.run to fetch them. There is also one more tricky bit about the Python implementation of Variables that you might need to know, which is that in ref variables, because they can just convert to a ref tensor, the following work you can take the return value of an assignment operation, and call another assignment operation on it, and do that as many times as you want, because assignment operations chain. And in resource variables, clearly, the assignment operations, they don't have a return value. Because if you were to return something like the handle, the handle, it's useless. It's the same as the input. No point in returning that. If we were to return the value of reading the variable, now that's an operation that might potentially be very expensive. And you'd like to not read it unless you're going to need to read it. So we added this notion of unread variable, which is a class that if you have a control dependency on it, it just has a control dependency on an assignment operation. But if you try to read this value, it's guaranteed to read the value after that assignment operation. And because this acts like a variable, we can use this to make the chained assignment work and a few other things. So if you see unread variables in your graph, you should know that this is kind of where you're dealing with. But if you've been paying attention, you've seen that the core set of operations for a variable does not self initialize. And this is by design. A lot of the early use cases of TensorFlow were optimized for shared parameter server training. And in that case, when you have multiple parameter servers, and multiple workers all talking to each other, you might want to initialize variables from scratch. You might want to load them from a checkpoint. And depending on your training policies, you might want to do different things. So the graph is agnostic as to how you do those things. The runtime is agnostic how you do those things. And the execution, like the session.run gets to setup policy. This is very important, because we have to change the policy many, many times until we finally made it mostly bug free in estimator. But as with tf2, as we're not necessarily saying that the default way to use TensorFlow is shared parameter server training, we went for ergonomics over safety. So in tf V2, mostly variables are initialized on creation. In eager execution, this is very easy to do. Because as soon as you execute the ops, you create a variable, we initialize it for you. In tf.function, it can be a little trickier, because the initializer for a variable might be defined inside a function. And there are a few ways to handle this. And I'm going to go into detail on this in the tf.function talk. Similarly, variables sharing is a complicated issue. If you're doing shared parameter server training, you would like all the workers that connect to the same parameter server to see the same variable so they can see each other's rights to those variables. And the way we did this was to say the variables are shared by name. So in tf V1, variable names are load bearing. If you change, or edit, or modify the names of variables, you dramatically change the behavior of the program. This is a questionable decision in all cases, because variable names, they look very harmless when you read code. So in tf2, we chose to make names non load bearing. Internally we're still using the runtime that assumes a load bearing name, but we always use a UID to hide that fact. And if you want to have shared names for our parameter server training, you can, because you can control that detail in the runtime. But the Python API no longer makes that straightforward. And now you might be asking, well, how would I be able to change how the details and variables are implemented? Another thing that we're adding in tf V2 is this notion of a variable creator that lets you control how variables are created. And so variables are a meta class so that when you call tf.variable you might not actually get an instance of variable. You might get an instance of some subclass of variable that defines some specific behaviors. In tf V1, by default, you get ref variable. In tf V2, by default, you get resource variable. But in other contexts, you might get other instances. The meta class code itself is not particularly interesting. It's just you should probably know this exists if you're dealing with variables in Python. So when instances like tf.function use its own subclass of variables, it behaves slightly different from the V1 graph resource variables when it comes to initialization, so that it can capture initializers and things like that. And it's nice that we can keep that code encapsulated within the tf.function package, and not push its complexity out to the same variable class that is used everywhere. Similarly, tf.distribute might need to create replica variables or mirrored variables with complicated read and write modes. And that complexity can be mostly centralized on a tf.distributor package instead of being spread out all over TensorFlow. So when you're inside a distribution strategy scope when you create a variable, your distribution strategy is probably setting up a variable creator that's going to do the right thing for you. And this is very important in TPUs, and in mirrored, and stuff. So it's good that we have this kind of flexibility. But just like how creation is configurable, deletion can be a little tricky. So a nice side effect of having load bearing names for variables in tf V1 is that it encourages you to have very few of them, and to think very carefully about what each of them was called. So the set of variables throughout the lifetime of a TensorFlow program was mostly fixed, which meant that deleting variables is mostly not a big deal. And you could get away with very broad, wide big hammers, delete variables, like session.reset. But in tf V2 it is very easy with eager executionary functions to create a lot of variables. And you can create temporary variables. So we do need to clean up after ourselves or we're going to have memory leaks. And you'd think that since this is Python, you should be able to just override DEL to get variables to clean after themselves. But it's not that simple. It turns out that if you override DEL on an object, and that object becomes part of a reference cycle, and if you've ever looked at the implementation of tf.variable, you'll see it has tens of members. So any one of them could point to something that could point to something, that could point back to that variable. And if anything with a DEL is part of a reference cycle, that entire cycle becomes uncollectable, and we have leaked that memory forever. However, there is an easy workaround, which is that if you make an object that is guaranteed to only have one or two data members that cannot possibly be part of a reference cycle, you can override DEL on that object, and then take an object that's complicated that might be a part of a cycle, and store a pointer from that expensive object to the small, cheap object that knows how to do the cleanup. This does not make the cycle uncollectable, and still guarantees that the clean up happens when the first object goes out of scope. Now, the worst that can happen is that our reference cycle means that your garbage collection is not immediate. It's just delayed until whenever the Python garbage collector decides to run. But that still guarantees correctness and a lack of leaks, even though it might be a little surprising that if you use sufficiently complicated objects, your GPU memory might take a while to clean. And you might need to use Python's GCE modules to force it to clean up after itself. And this pattern of making a deleter object is used everywhere in the TensorFlow code base that we have resources and that we need to override DEL, just to ensure that we have orderly cleanup. So that's essentially all you need to know about resources to effectively use them in TensorFlow. And now I'd like to move on to talk about variants. And I put those two things together, because for the longest time there was a conflation of views between resources and variants. Because resources were like the easiest way to just hook arbitrary C++ code inside a TensorFlow runtime. But it turned out that a lot of the things that we were doing using resources to do were better served by not arbitrary C++ code, but by stateless operations on immutable values. And why would you want that? Mostly because stateless things on immutable values are much easier to compile. And they're also much easier to differentiate through. And differentiation is something we really care about. So [INAUDIBLE] had the idea of making a separate D type variant for immutable arbitrary C++ stuff. Its implementation is very, very similar to something like absl any, and other arbitrary types, like dynamic types in C++, with a few bells and whistles to integrate better in the tf ecosystem. So a canonical example of variance is the tensor list ops, which are used under the hood to implement stacks in TensorFlow V2 in tensor arrays. But also they are one of the original motivating factors. And they look like this. You can have an op that makes an empty tensor list. Then you can have another op that takes a list and a value, and spits out a new list that represents the concatenation of those things. And then you have an op that takes a list, and spits out a slightly shorter list, and the value is removed from the list. And you can inspect those values and manipulate them. And the fun thing about these is that because these are all immutable, you can easily define their gradients. And if you think about it, the gradient of push is pop. The gradient of pop is push. The gradient of set item is get item. It mirrors very nicely. So you get code that's efficiently differentiable up to higher orders. And internally, the tensor list structure can be very simple. It's just an std vector of tensors and some metadata about shapes and D types. We need these methods in code and decode so that we can serialize and de-serialize lists in case we need to send them across devices. Though specific variants can choose to not implement those methods and throw errors instead. And if you've been following this, though, and you saw the previous slide where I had a CD vector, and you saw the slide before that where the ops would take one and return a new one, you might have been terrified that this had automatically made every single recursive neural network O of N squared. But the TensorFlow runtime has this nice optimization where a kernel is allowed to ask the runtime if anyone else is ever going to use one of its input tensors again. And if the answer to that question is no, the kernel can go and mutate that tensor. So this, incidentally, is how tensor lists work. And in the normal use cases, like when you're using them for stacks, after you've pushed something into a stack, there are no more references outstanding to the previous value of the unpushed stack. So we can just reuse its memory and append, and get exactly the same O of N performance that you would expect to get from the stateful version. However, we're doing this with stateless operations. So we get to differentiate through this code. And if you do end up holding an extra reference to something that you want to mutate or apply a mutating op later, the system will silently do a copy behind you to ensure the correct behavior. And this is also good, because we, again, managed to decouple the behavior from the implementation. So we can take operations that have exactly this meaning, give them to a compiler. And the compiler might be able to [INAUDIBLE] that copy if it can prove that it happens at some point in time. Or use a different internal representation for these tensors. Yes. AUDIENCE: And this copy is just the copy of a vector of tensors, and the tensor buffers themselves. ALEXANDRE PASSOS: The tensor buffers themselves never need to be copy, because that's a separate level. But again, even if you just copy the vector of tensors, you can still see that show up in some profiles. So one more thing you need to do if you want to define your own variant D type and have it work seamlessly with automatic differentiation is you need to tell TensorFlow how to add two of these, and how to make a zeros like, because these are operations that auto diff needs to do all the time. It's not obvious why auto diff needs to make zeros. And happy to talk about this some other time. It has something to do with differentiating operations that have multiple outputs, and doing that in a single bit of a code that doesn't have to be aware that some of those outputs might not have been used so they do not have upstream gradients. So essentially, this is it. This should be all you need to know to understand how state and how arbitrary C++ stuff is represented in TensorFlow. There are many other variant D types other than the tensor list. That is just one. That was one of the first ones. And it's one that showcases all the little bits in there, which is why I chose to talk about it. Similarly, there are many other resource D types other than the variable one. But variable is by far the most complicated. So if you understand how that works, you should understand all the others. Happy to take questions now. But if you're watching this on YouTube and you're not in this room, you can email your questions to developers.tensorflow.org, where we have discussions about TensorFlow internals. AUDIENCE: I have on question. So could you talk a little bit about the decision to have this one catch all type versus making it easy to add new D types? ALEXANDRE PASSOS: The one catch all type for resource? For variant? AUDIENCE: For variant. Yeah. ALEXANDRE PASSOS: Ah. AUDIENCE: [INAUDIBLE] ALEXANDRE PASSOS: Yeah. That's a questionable decision. I think it mostly comes from the fact that originally TensorFlow did not make it very easy to add new D types. There are all sorts of enumerations and specializations that have to happen on a per type basis. So having a hook that lets you easily add a type without any changes to the runtime was considered important. I don't necessarily think that this is the end stage. And maybe at some point in the future we should stop representing lists as a variance, and start representing them as a list D type. Which will allow the runtime to specialize to them in a better way. AUDIENCE: So D type would become a string instead of an int. ALEXANDRE PASSOS: But in the full case, the D type has become a string instead of an int, we'd have to stop having switches from D types everywhere in our code base. But it might make sense to add lists as one of the ints. But, again, each D type string will dramatically increase the TensorFlow binary size, because we need to register all sorts of kernels for all D types, even if we don't have to. There a few unintentional side effects. It makes sense to specialize to a small set of things, like fast, dense buffers of numbers, because that's most of our expensive computation. AUDIENCE: What have been some of the more common pitfalls that you've seen that people had? Like buggy or racy code initially. And as they've gone to resource variables, they need to either restructure their code, or work with the bugs that [INAUDIBLE].. ALEXANDRE PASSOS: There are many, many, many, many, many, many bugs. AUDIENCE: Global step is one. ALEXANDRE PASSOS: Yeah. Sorry? AUDIENCE: Global step is one. ALEXANDRE PASSOS: Yeah. The one that most people see is that if you go back to this guy? Yeah. S Graph element. One unintended consequence to this is that because session.run is not allowed to create a new operation in the graph, reading the variable has to pre-create a tensor that reads its value, which means that if you fetch a variable on the same session.run step as you have done some mutating operation, and the variable's a resource variable, you're guaranteed to see the value before the mutation. The read will happen before the assign, just because they don't have any control dependencies. Now, not guaranteed. You're almost guaranteed. Because they don't have any control dependencies either way. So you get non deterministic behavior. But the read is cheap and runs fast, and it has no dependencies on it. But usually you have to compute something to get to the assign. While with ref variables, because of that aliasing behavior, you're fairly likely to see the value after the assignment under the assumptions that everything was on the same device and stuff. So you have all sorts of unit tests that people write that get confusing. This, I think, is the only regression we've had. If you look at bugs from variables, there are many. You'll see them in sanitizers, like thread sanitizer and address sanitizer fire up on the TensorFlow runtime often, due to those race conditions involving variables. My favorite one is a combination of ControlFlow V1 and ref variable, because ControlFlow V1 on the conditionals doesn't create a single predicate. It creates many switch operations. And if the input to those switch operations is a reference variable in one of the branches assigned to the variable, then half of your switch operations are going to execute one branch. And the other half is going to execute the other branch of the conditional. And with that TensorFlow [INAUDIBLE],, this can lead to bizarre, undefined behaviors. This is a very fun one. And another problem is that you can have optimizations that you can apply. For example, Grappler likes to rewrite things like tensor plus 0 to tensor. Because sometimes that zero might have been added there by some complicated graph, that graph that I just managed to constant fold and prove that it's a zero. And due to the implementation details of ref variables, plus it's guaranteed to copy a variable. So if you wanted to copy a variable so that you could have its value before a write to compare it with the value after a write and find out by how much it changed, and Grappler rewrites your plus zero to just keep it the value of the variable, now your program has been broken. So you have all these very subtle interactions between things that you would think are harmless. So you see a few hedge patterns in TensorFlow a lot. You see people putting lots of identity tensors in different devices to force a send and receive, to force a copy. You also have the gradient code. It has this fun fact, where if you're backdropping for a deep neural network, once you've computed the gradient with respect to a variable, you can do two things. You can update the value of the variable. Or you can compute the gradient with respect to the input layer. Well, computing the gradient with respect to the input layer is a matrix multiplication, or transpose convolution between the value of the variable and the upstream gradients. So if you've already mutated the value of the variable, you're now computing the wrong gradients. And so this leaked into the gradient code, which has a gate gradients argument due to ref variables so that it protects the backdrop from being affected by assignments to the variables. Which has side effects, which means things like we lose performance. Like the topmost layer of a neural network. You only need to compute the gradient with respect to the variables, not with respect to the inputs. But because of the gating code, we have to force the computation with respect to the inputs so that you can guarantee that if there was a layer before it, we would have updated that variable before we had seen the gradient with respect to those inputs. It also does not allow us to overlap variable updates very well with their gradient computation. I can keep going. There is a lot of code in this send, receive scheduling that tries to prevent complicated deadlocks that can happen when you have variable assignments, and also complicated cases where you tend to not send the value of the variable that you thought you were sending. AUDIENCE: So I guess a follow on from this would be that in a world with only resource variables, does this present opportunities to remove some behavior, or things that people were kind of relying on? ALEXANDRE PASSOS: Yeah. There's a lot of code that we will be able to delete if we no longer have to support ref variables. And a lot of this code is complicated, and buggy, and very hard to maintain. AUDIENCE: Let me ask the reverse of that question. Do you know of anybody who is actually relying on ref variable behavior? ALEXANDRE PASSOS: Yes. AUDIENCE: So that issue that I told you about, the plus 1, there's this thing called the global step in Estimator that is incremented on every training step, and is read on every training step. And every estimator user has a bunch of hooks that rely on checking the value of the global step after every training step. So everybody who is doing estimator training in a single machine case is effectively relying that they can read the global step after it, right, by just separately fetching it. AUDIENCE: And they don't care necessarily if the value is out of sync between the different reads? AUDIENCE: The ref behavior ends up being that they get the value after the assignment. ALEXANDRE PASSOS: Because it's an int variable, it gets forced placed on the CPU. It has all these silent requirements that conspire to allowing people to rely on this. So our own code relies on this quite a bit. In practice, it's not a big deal, because most Estimator users are doing on distributor training. And when you do distributor training your variables up on other devices, you no longer have this guarantee that you will always read exactly the value after the assignment. So all the cooks have to be robust in not reading that. But the unit tests for the hooks rely on the fact that they all run on the same device. That is a big one. I have seen some cases where you might rely on the fact that you can do both snapshots and sparse rights to a variable efficiently in the ref variable case with race conditions. If you're implementing some neuro computational memory thingy, you might want that behavior. And that's one of the cases where I think we might need to just implement a separate policy for the how to do your variables to make it work. AUDIENCE: So why not use variance for literally everything? Just get rid of all the other D types? ALEXANDRE PASSOS: Because we can specialize the runtime to the other D types to make it faster. If you have variants, you have to do more runtime dynamism to figure out-- like if you want to add to floats, now you have to check at runtime that they are floats. And you need to rewrap them into the variant thing, which has one extra point or D reference. AUDIENCE: And you get also less [INAUDIBLE].. ALEXANDRE PASSOS: Yeah. AUDIENCE: Yeah. While [INAUDIBLE], you mean? AUDIENCE: Yeah. Well, it might not have been two float. ALEXANDRE PASSOS: A float and an int. You wouldn't know. AUDIENCE: Or a float and a string. ALEXANDRE PASSOS: Which is, incidentally, one of the reasons why you want to move lists out of variants so we can get better type checking for them. So that you don't accidentally add like a list and a mute x together or something like that. AUDIENCE: But would that be just a list, or a list with a pair of what the types of elements would be? ALEXANDRE PASSOS: It will be very interesting if we could extend the notion of type in a TensorFlow graph to include a little more information than just an enum. I think that's a separate project. And I don't know if anyone is working on it. AUDIENCE: But would that checking only be the first step, though? ALEXANDRE PASSOS: Yes. AUDIENCE: That's not too bad. AUDIENCE: So you mentioned having more types close to the binary. Is that just inflow? Or 32 and 64, all of these? ALEXANDRE PASSOS: All of these. And incidentally, we also don't have very good coverage of our type. So as of the recording of this talk, a lot of the U int types only work inside XLE. And they sometimes work outside of XLE if you go for the ops that don't actually look at the types, like identity. But if you try to do operations on them, most of the kernels are not registered. And there's no good reason for that, other than like binary science and legacy. It's just you end up with lots of holes when you write more D types that take some time to patch. OK. I think this is about as much time as we have. So thank you very much. [APPLAUSE]
B1 variable tensor runtime graph tf ref Inside TensorFlow: Resources and Variants 2 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary