Placeholder Image

Subtitles section Play video

  • ALEXANDRE PASSOS: Hi.

  • My name is Alex.

  • And I'm here to tell you today about resources and variance.

  • And really this is a talk about state in TensorFlow

  • and stuff that got accidentally represented in state

  • in TensorFlow for far too long.

  • So what is state?

  • I would love to be able to stand here, or rather

  • sit here, and tell you that an operation is stateful

  • if either executing it has a side effect,

  • or if its output depends on something

  • other than the value of its input.

  • But this is not what TensorFlow means by state flow.

  • Sadly, TensorFlow goes by the [INAUDIBLE] notion

  • that the meaning of a word is defined by its usage.

  • So state in TensorFlow is defined

  • by this one bit that gets flipped and means all sorts

  • of very interesting things.

  • So, for example, this slide is wrong.

  • tf.print is stateful.

  • It has a side effect.

  • Yay.

  • tf dataset from tensor slices has no side effects,

  • because the data set operations are value types,

  • and they're stateless.

  • And yet, that kernel is marked as stateful.

  • Because one of the effects of marking something

  • as stateful in TensorFlow is that it

  • disables constant folding.

  • And constant folding can be buggy with data sets.

  • Iterators, on the other hand, are stateful.

  • This might lead you to think that there

  • is some meaning to this.

  • But there are also some things in TensorFlow

  • that could go either way.

  • So to differentiate while loops, we have stacks.

  • So that when you're doing the forward press of the loop,

  • you push things into a stack.

  • And when you're doing the backward pass,

  • you pop things from the stack.

  • So you can look at intermediate activations and stuff.

  • And those things were stateful in tf V1,

  • but they're stateless in tf V2.

  • Tensor lists that you can use to aggregate stuff

  • from many iterations of a loop into a single view,

  • or do the reverse, they're also stateful in tf V1 and stateless

  • in tf V2.

  • AUDIENCE: Is that because we didn't invent the stateless way

  • until later?

  • ALEXANDRE PASSOS: Because we did not invent the stateless way

  • until later.

  • Yes.

  • So I want to spend the rest of the stock talking about how

  • statefulness is represented in tf V1, some of the problems

  • with that, how we're fixing those problems in tf V2,

  • and how we can deal with state, and also with things

  • that are not necessarily easily representable

  • with dense tensors.

  • So how is statefulness represented?

  • In one of two ways--

  • the most obvious way is that if you will go on the TensorFlow

  • source code, and you find where ops are registered,

  • you will see this bit.

  • Set is stateful.

  • And the definition of state in TensorFlow

  • is that opdevs that have this bit set are stateful.

  • And all sorts of places in the runtime

  • are going to look for that bit and behave differently

  • if that bit is set.

  • And people set the bit because they

  • want any of those behaviors.

  • And this is something we need to clean up.

  • And I think we might have a chance to clean this up

  • with the MLIR dialect of TensorFlow, which

  • is going to have more finer grained bits.

  • But until then, we're stuck with this one bit

  • that has too much precision.

  • So among other things, what does this bit mean?

  • It means that TensorFlow will not do constant folding.

  • This includes the two or three separate systems

  • in TensorFlow that do constant folding.

  • All of them know how to bypass stateful operations.

  • Similarly, there are at least two different places

  • in TensorFlow that do common sub expression elimination.

  • And they refuse to do common sub expression elimination

  • of stateful operations, which is very good, because if you're

  • to do that, and you have a neural network

  • with many layers, and your layers are initialized

  • from a random op, all of the layers with the same shape

  • would be initialized with exactly the same random values.

  • AUDIENCE: And all your prints would potentially

  • be collapsed into a single print.

  • ALEXANDRE PASSOS: Only prints of the identical string

  • would be collapsed into a single print.

  • Because otherwise we would have enough information

  • to disambiguate those.

  • But statefulness also means some things that

  • are not very obvious at all, like the op kernel

  • instances that the runtime uses to represent the computation

  • to run are reused across sessions for op kernels

  • that are for stateful ops that have the same name.

  • And there are also a somewhat long tail of obscure behavior

  • changes, like parallel four behaves

  • slightly different for stateful operations.

  • And people are known to set a stateful bit for any one

  • of these reasons and more.

  • The other way of representing state in tf

  • that we're trying to get rid of in tf V2

  • is the notion of a ref tensor.

  • And going back to the variable op,

  • it is this thing here, where you can

  • say that a tensor is either of a D type, or of a D type

  • ref of D type.

  • And what that means is that the reason why we did that

  • is that it's very convenient in many cases

  • to be able to keep information in the runtime that persists

  • across call session.run.

  • Specifically, the variables-- if you

  • had to write your code like this, where every session.run

  • you'd feed your variables and then you'd fetch them back,

  • and you were doing some kind of distributed training,

  • you would have so many network round

  • trips and so much extra latency for this,

  • it would be completely impractical.

  • So the idea of the variable op, which

  • is the thing that motivated the ref tensor, is like a constant,

  • but mutable.

  • And if you try to dig for the runtime,

  • you'll find this piece of code, which

  • I think is the most concise representation I could find

  • of how do we represent the distinction between a ref

  • tensor and an auto tensor.

  • This is what the input to an op kernel looks like.

  • And it's essentially a manually implemented ABSL one off,

  • where one, it's either a manually constructed tensor--

  • and the manual constructor isn't there,

  • just so we don't try to re-initialize it

  • in case we're not going to need it--

  • or the spare of a pointer to a tensor

  • and a pointer to a mute x.

  • And if you've ever programmed in C++,

  • you should be terrified right now, because you see a pointer,

  • and you see no comment about who owns this pointer,

  • and what is the lifetime of that pointer?

  • And a good third of the issues of ref variables

  • come from the fact that it's been impossible or very

  • hard to retrofit into the system a coherent notion of ownership

  • of this pointer that's going to be memory safe.

  • But that's not all.

  • The way the ref variables work is

  • that you have a graph that looks like this.

  • You have this variable node whose output

  • is a tensor that can change, and you can feed it

  • to an operation that mutates it, like assign,

  • or you can feed it to an operation that does not

  • mutate it, like identity.

  • If you feed it to an operation that does not mutate it,

  • like identity, the TensorFlow runtime

  • will silently cast that tensor star to a tensor.

  • So make another tensor object that

  • aliases the buffer pointer by that tensor,

  • and just keep going.

  • So the reason why I like this graph is that it's short.

  • It's simple.

  • If you look at every single gradient update

  • that we use for training, it kind of looks like this.

  • But it's also kind of tricky.

  • So we have, I don't know, like 20, 30 people in the room now.

  • Can I get a show of hands on who thinks

  • that the result of the print is the value after the assign?

  • No one.

  • AUDIENCE: What do you mean?

  • The print?

  • ALEXANDRE PASSOS: So this graph, it

  • has an add that takes as input the identity of the variable,

  • and some constant.

  • And it prints.

  • And it has a controlled dependency

  • from an assign that mutates the value of the variable

  • to the add.

  • So how many people think this is enough to ensure

  • that add will see the value of the variable

  • after the assignment?

  • OK.

  • About five or six.

  • AUDIENCE: Yes.

  • ALEXANDRE PASSOS: How many people

  • think that add will see the value of the variable

  • before the assignment?

  • About two or three hands.

  • How many people think this is a segmentation fault?

  • [LAUGHTER]

  • No one.

  • And how many people think it depends on things that

  • are not written in the graph?

  • AUDIENCE: 100.

  • ALEXANDRE PASSOS: OK.

  • So all of you have been bitten by this, because I

  • got like 15 hands now.

  • Is this completely non deterministic,

  • and it depends on all sorts of runtime properties?

  • For example, if everything is in the same device,

  • and the assign does not change the shape of the variable,

  • because of the way we do aliasing inside the TensorFlow

  • executor, print will print the value after the assignment.

  • However, if the add is in a different device

  • from the variable, then most likely there will be an RPC,

  • and add will sometimes see the value after,

  • sometimes see the value before the assignment.

  • There is one case where add is guaranteed

  • to see the value before the assignment, which

  • is if the assignment changes the shape of the variable.

  • Because if the assignment changes

  • the shape of the variable due to intricate details

  • of the implementation of tensor in tensor buffer,

  • we do not change your existing tensor buffer.

  • We just allocate a new one.

  • And by the time the identity runs,

  • it has already aliased the old tensor buffer.

  • And you can get a seg fault here,

  • as you might have guessed, if we're

  • talking about string D types.

  • Because add is defined for string types.

  • And if you have two separate threads that are reading

  • and writing to a string in C++, you're very likely to get a seg

  • fault or some other weird behavior.

  • So this is pretty complicated and kind of

  • unworkable in the long term.

  • You need to know about all sorts of things that are not

  • well documented, and that rely on specific details

  • of the implementation that are not guaranteed to stay stable.

  • And if you were to try to design something

  • like a compiler for TensorFlow, this

  • would be really hard to make work.

  • So we're not doing this anymore.

  • And I'm going to spend the rest of this talk, hopefully,

  • telling you about how we're fixing this

  • and imposing some order onto the situation in tf2.

  • And the interesting thing is that internally the way

  • variables have always been represented in Tensor Flow--

  • almost always, I guess, since the first open source

  • release is the state has been stored in this resource manager

  • object, which has a create, a look up, and a delete method.

  • And these can return some arbitrary type.

  • We use some RTTI magic to make sure

  • that the code is runtime and type safe.

  • We even implement RTTI on compilers that do not

  • have RTTI to make this work.

  • RTTI, sorry, is a C++ thing for runtime type identification.

  • And this is a perfectly reasonable API

  • if you wanted to represent state outside of the graph.

  • So the idea that we had in tf2 is

  • let's use this to represent the state as essentially

  • operations in the graph.

  • So there's still some issues of resource manager,

  • like its scope to device objects.

  • And device objects have a weird lifetime.

  • Sometimes they outlive a session.

  • Sometimes they do not outlive a session.

  • And it's slightly different with eager execution.

  • And this can be very surprising in some cases, both

  • when you're doing parameter server training,

  • and when you're not, and you accidentally

  • find yourself doing parameter server training,

  • and intentionally serving parameters between two models

  • that are not supposed to.

  • But overall, it's a reasonable API.

  • So what we did is we created a tensor D type, just

  • like string, or int, or float, that

  • represents the information you need to look something up

  • in the resource manager.

  • And we call this, creatively, DT Resource.

  • The reason why this is a tensor is that this is just

  • another value.

  • So you can pipe it through a graph.

  • You can stack things together.

  • You can select them dynamically if you want.

  • Or you can just use them statically.

  • It's just a scalar most of the time.

  • Like you can have non scalar tensor DT resources,

  • but most of the interesting operations just want scalers.

  • And then you can use this tensor to manipulate a resource.

  • So internally it's, again, just information

  • you need to make the lookup to the resource manager

  • minimally type safe-- so the device, a container, a name.

  • The container was this idea that we had originally that you

  • would be able to run many separate models

  • on the same parameter server and provide some kind of isolation

  • where you could reset the variables in one model,

  • but not the variables in the other model.

  • The way this was implemented made this very hard to use.

  • And I know very few people who rely on this now.

  • So these days, it's mostly just [INAUDIBLE]..

  • But otherwise, it has a name and some information

  • to validate that you're looking up

  • the object of the right type.

  • But that's all there is to it.

  • And resources are special cases in a couple

  • places in the runtime--

  • not as many as the stateful bit.

  • And one of them is that if you create an op that specifically

  • manipulates-- either takes or returns a tensor of a resource

  • D type, we mark it as stateful, because you assume

  • that if you're asking for a key to something in a resource

  • manager, you're probably going to monkey around with it.

  • And this at least removes the redundancy, because otherwise

  • you would have all these ops that would take resources,

  • modify state in a resource manager,

  • not be marked as stateful, and you

  • would have to wait until they got accidentally

  • constant folded together to see something break.

  • And the second one is that the placer will always

  • co-locate operations that manipulate

  • a resource with the device where the resources is in.

  • And this is because you can't really

  • modify a structure that's in another computer

  • without running code on the other computer.

  • But mostly resource handle is safe in the runtime.

  • And the interesting thing is that now our graph

  • that was very hard to read looks like this.

  • You have this far handle op that represents

  • the resource handle, the key.

  • And you can pass that key to your assignment.

  • You can pass that key to your read operations, et cetera.

  • And now I'm pretty sure everybody should agree with me

  • that this graph, as written, has to return

  • the value of the variable after the assignment.

  • Otherwise, it's a bug.

  • And this is true.

  • There is no weird non determinism.

  • It doesn't matter whether the shape changes or doesn't

  • change, what D type you're dealing with,

  • what device things are on.

  • Also, there is no way to make this seg fault, I believe.

  • So it's substantially nicer.

  • There's still some subtle things in here.

  • One of them is resource gather.

  • It's an operation that you would think, why would I need this?

  • Because what it does is it does effectively

  • what read plus gather do.

  • But it does it in the single op.

  • And the reason why we have this is

  • that if you think about this, if I really

  • want this to provide proof forever that this graph is

  • allowed to have the same meaning of always reading

  • the variable after the assign, and if you had flipped

  • that control dependency between read and assign,

  • you would now be always reading the variable before the assign,

  • you might have to make a copy to ensure

  • that this memory is preserved.

  • And if you have a very large vector of embeddings,

  • making copies of it can be very expensive.

  • And we would like to provide good performance.

  • So really this resource thing is more

  • a specification of the meaning of a graph that

  • has these operations and less the specific details of how

  • they're implemented.

  • It's possible to have many valid implementations of this,

  • and they're going to have different performance

  • characteristics.

  • So, for example, if we load our graphs to XLA for computation,

  • XLA can take a cluster of ops that have a bunch of reads

  • and writes to variables, look at the state of the variables

  • before they're clustered, figure out what the state of variables

  • should be after the cluster, and rewrite

  • it to be a bunch of reads, some stateless computation, and then

  • a bunch of assigns.

  • And this correctly preserves the semantics of these operations.

  • And it's a perfectly valid way to do this.

  • We don't always run XLA, though.

  • And if you start thinking about this,

  • there are two relatively straightforward ways

  • you could implement variables.

  • And they have pretty strong performance trade offs.

  • A very obvious implementation is copy

  • on write, where you would copy the buffer for a variable

  • every time we write to it.

  • Another one is copy on read, where the read operation is

  • going to copy.

  • And then the assign operation is just always going to mutate.

  • The interesting thing is that copy on write,

  • if all you're doing is your standard std training where

  • you read a bunch of variables in the beginning,

  • do a bunch of forward and backward computation,

  • and then you write to the bunch of variables,

  • you can do this with zero copies.

  • Because by the time you're writing to the variables,

  • there are no outstanding reads left.

  • So yay.

  • Similarly, if you have embeddings,

  • and you are sparsely reading a few rows from your variable

  • in arbitrary, random order, and then

  • later on you're going to sparsely write to those rows,

  • we can do this with no copies if we have copy on read.

  • I mean, no extra copies.

  • Since the reading would have to copy anyway,

  • because it's reading in an unstructured way

  • that we couldn't preserve, like strides or something like that.

  • So which one do we choose?

  • And effectively, we chose both.

  • And we did this by storing a bit on variables

  • and having variables always start in a copy on write mode.

  • And as soon as you do any sparse operation on a variable,

  • we grab an exclusive lock, and make any copies that we need,

  • and put it on Copy and Read mode.

  • This works reasonably well for both the only use

  • this variable in dense operations case,

  • and you only use this variable for embeddings' case.

  • It's not necessarily generally the best idea.

  • So I expect this policy might have to change and become

  • more refined over time.

  • But again, this is just an implementation detail.

  • And this does not affect the correctness of the programs

  • that are running on TensorFlow.

  • So I think it's a big improvement of--

  • AUDIENCE: Can I clarify?

  • ALEXANDRE PASSOS: Yes.

  • AUDIENCE: So I thought when we read something,

  • effectively it makes a copy.

  • It seems like this copy is specifically

  • in the context of [INAUDIBLE].

  • ALEXANDRE PASSOS: It pretends to emit a copy.

  • So the definition of a read is that a read is guaranteed.

  • An operation that looks at the output of a read

  • is guaranteed to see the effect of every operation that

  • had an edge pointing to the read and not

  • see the effect of any operation that had

  • an add pointing from the read.

  • You can implement this by making a copy on read.

  • You can also implement this by making a copy on write.

  • You can also implement this in more complicated ways

  • that might never make a copy.

  • AUDIENCE: So our default copy on write

  • looks at the reference count, and if it's

  • one, just updates in place.

  • And our default read operation just increments the reference

  • count.

  • ALEXANDRE PASSOS: Yes.

  • AUDIENCE: The default copy on write implementation.

  • ALEXANDRE PASSOS: The copy on write semantics do that.

  • And I assume we're going to eventually switch

  • to more complicated policies.

  • For example, we could look at the graph,

  • and then decide what policy we're

  • going to use to write the variables on this graph.

  • Or we could let users configure this.

  • There are many options here, but ideally, we

  • should be able to implement all of them

  • without requiring that users change the graph

  • structure to get better performance

  • or to get correctness of their behavior.

  • And this is what's important about this, because this means

  • that we got to fundamentally and dramatically change

  • the back end, like use a compiler,

  • and not have to worry about preserving

  • bug compatibility, what happens if you

  • A list the output of identity on another variable,

  • or something like that.

  • So far I've mostly focused on how

  • the runtime treats variables.

  • But the same fundamental patterns

  • of a handle tensor and operations that read and write

  • to it is used in all sorts of other bits

  • of runtime state in TensorFlow.

  • This includes the data set iterators, FifoQueus,

  • HashTables, and a few more things that I have forgotten.

  • AUDIENCE: Are mute xs resources?

  • ALEXANDRE PASSOS: Mute xs, they're a resource.

  • But they also have a variant that represents the mute x lock

  • object.

  • So it's a slightly funner situation.

  • But as far as the resource part of the mute x part

  • is concerned, it's, again, a mutable resource

  • tensor that has a handle.

  • It has operations to modify it.

  • So this is nice.

  • And this is just essentially what the runtime looks like.

  • And if you have this picture in your head,

  • you should be able to mostly predict

  • the behavior of TensorFlow programs that manipulate state.

  • One other bit is TensorFlow, there's shape inference.

  • I'm sure if you've looked at TensorFlow op registrations,

  • you've seen annotations like this where we set shapefn.

  • The result of shape inference is not persisted in the graph.

  • It's ephemeral.

  • It's produced every time we create a graph,

  • or while we're importing a graph.

  • But this is very, very useful to ensure not only

  • that we know how to interpret the graph correctly

  • and that the graph is valid.

  • But this is very helpful during the graph building process,

  • where user code can inspect the inferred shapes of nodes,

  • and make different decisions as to whether things can be

  • dynamic or static in the graph.

  • And if all the resources are scalers,

  • this would make it hard to do shape inference

  • on stateful operations that manipulate resources.

  • So we did kind of a hack that should be improved

  • and added a side channel to the shape inference process,

  • this output handle, shapes, and types

  • that can store an arbitrary list of shapes and D type objects.

  • And different resources and variants

  • are going to assign different semantics to this.

  • Operations like cast that do not affect the shape, just pass

  • the shapes and D types through, and then

  • operations that are aware of what the resource handles

  • are doing are going to look at this

  • and assign meaning to them.

  • So variables just store a single shape in D type

  • for the value of the variable.

  • Tensor lists store a shape in D type in there for the shape

  • and D type of the elements in a tensor list.

  • Iterators store the shape and D types of all the tensors

  • that you're going to get when you call iterate

  • or get next so that we can properly do shape

  • inference on those graphs.

  • So now that you mostly have a reasonable picture of what

  • resources look like in the runtime,

  • I'd like to pop the stack and talk a little bit

  • about the Python side.

  • So this is going to mostly focus on variables, because I think

  • there are a few interesting things in there that

  • will, again, generalize through other bits in the runtime.

  • The first one is that if you've used TensorFlow before,

  • you know the variables act like tensors.

  • You can pass them to operations.

  • You can use the operators on them.

  • And part of this reason is historical.

  • I think the first implementation of Variable in TensorFlow

  • was literally just the return value of the Variable op.

  • And that happened to be a tensor of reference D type.

  • Later we felt the need to replace that with a class.

  • So we worked somewhat hard to make that class behave exactly

  • like a tensor.

  • And this is something that sometimes library writers

  • downstream from TensorFlow want to have their own types

  • that behave like tensors, or behave like variables.

  • So how do you do this?

  • And I strongly believe this is all you need.

  • First thing to do is you need to make your type

  • convertible to a tensor.

  • So there is a tf.register tensor conversion function

  • that takes the type and a function

  • to convert that type to a tensor.

  • In the case of a variable, it just

  • reads the value of the variable.

  • Easy.

  • There are some special cases in there

  • to deal with reference types that are no longer needed,

  • thankfully.

  • Another thing that you need to do

  • is register your type as a dense tensor

  • like type, which means that implicitly after a stack,

  • by just putting many instances of that type in a list

  • will work by silently reading and then calling stack.

  • Then you need to overload all operators.

  • And if you look, there's this method,

  • overload all operators in the class for tf Variable that

  • has implementation for this that will steal all the operator

  • overloads from tensor.

  • And there is a rule in TensorFlow

  • that session.run is not allowed to add nodes to the graph.

  • This can catch all sorts of terrifying bugs.

  • So it's good that we have this rule.

  • But if you want to be able to fetch the value of a thing,

  • then you need to implement this underscore as graph element

  • method, which session.run pokes to see

  • if it is there, which is supposed to return

  • a pre-existing tensor.

  • And so variables have to record a tensor that

  • is going to be the result of reading them.

  • So storing there, you can use session.run to fetch them.

  • There is also one more tricky bit

  • about the Python implementation of Variables

  • that you might need to know, which

  • is that in ref variables, because they can just convert

  • to a ref tensor, the following work you can take the return

  • value of an assignment operation,

  • and call another assignment operation on it,

  • and do that as many times as you want,

  • because assignment operations chain.

  • And in resource variables, clearly,

  • the assignment operations, they don't have a return value.

  • Because if you were to return something like the handle,

  • the handle, it's useless.

  • It's the same as the input.

  • No point in returning that.

  • If we were to return the value of reading the variable,

  • now that's an operation that might potentially

  • be very expensive.

  • And you'd like to not read it unless you're

  • going to need to read it.

  • So we added this notion of unread variable, which

  • is a class that if you have a control dependency on it,

  • it just has a control dependency on an assignment operation.

  • But if you try to read this value,

  • it's guaranteed to read the value after that assignment

  • operation.

  • And because this acts like a variable,

  • we can use this to make the chained assignment

  • work and a few other things.

  • So if you see unread variables in your graph,

  • you should know that this is kind

  • of where you're dealing with.

  • But if you've been paying attention,

  • you've seen that the core set of operations for a variable

  • does not self initialize.

  • And this is by design.

  • A lot of the early use cases of TensorFlow

  • were optimized for shared parameter server training.

  • And in that case, when you have multiple parameter servers,

  • and multiple workers all talking to each other,

  • you might want to initialize variables from scratch.

  • You might want to load them from a checkpoint.

  • And depending on your training policies,

  • you might want to do different things.

  • So the graph is agnostic as to how you do those things.

  • The runtime is agnostic how you do those things.

  • And the execution, like the session.run

  • gets to setup policy.

  • This is very important, because we

  • have to change the policy many, many times until we finally

  • made it mostly bug free in estimator.

  • But as with tf2, as we're not necessarily

  • saying that the default way to use

  • TensorFlow is shared parameter server training,

  • we went for ergonomics over safety.

  • So in tf V2, mostly variables are initialized on creation.

  • In eager execution, this is very easy to do.

  • Because as soon as you execute the ops, you create a variable,

  • we initialize it for you.

  • In tf.function, it can be a little trickier,

  • because the initializer for a variable

  • might be defined inside a function.

  • And there are a few ways to handle this.

  • And I'm going to go into detail on this in the tf.function

  • talk.

  • Similarly, variables sharing is a complicated issue.

  • If you're doing shared parameter server training,

  • you would like all the workers that

  • connect to the same parameter server

  • to see the same variable so they can see each other's rights

  • to those variables.

  • And the way we did this was to say the variables

  • are shared by name.

  • So in tf V1, variable names are load bearing.

  • If you change, or edit, or modify the names of variables,

  • you dramatically change the behavior of the program.

  • This is a questionable decision in all cases,

  • because variable names, they look very

  • harmless when you read code.

  • So in tf2, we chose to make names non load bearing.

  • Internally we're still using the runtime that

  • assumes a load bearing name, but we always

  • use a UID to hide that fact.

  • And if you want to have shared names for our parameter server

  • training, you can, because you can control

  • that detail in the runtime.

  • But the Python API no longer makes that straightforward.

  • And now you might be asking, well,

  • how would I be able to change how the details

  • and variables are implemented?

  • Another thing that we're adding in tf V2

  • is this notion of a variable creator

  • that lets you control how variables are created.

  • And so variables are a meta class

  • so that when you call tf.variable

  • you might not actually get an instance of variable.

  • You might get an instance of some subclass of variable

  • that defines some specific behaviors.

  • In tf V1, by default, you get ref variable.

  • In tf V2, by default, you get resource variable.

  • But in other contexts, you might get other instances.

  • The meta class code itself is not particularly interesting.

  • It's just you should probably know

  • this exists if you're dealing with variables in Python.

  • So when instances like tf.function

  • use its own subclass of variables,

  • it behaves slightly different from the V1 graph

  • resource variables when it comes to initialization,

  • so that it can capture initializers and things

  • like that.

  • And it's nice that we can keep that code encapsulated

  • within the tf.function package, and not push its complexity out

  • to the same variable class that is used everywhere.

  • Similarly, tf.distribute might need

  • to create replica variables or mirrored

  • variables with complicated read and write modes.

  • And that complexity can be mostly centralized

  • on a tf.distributor package instead of being

  • spread out all over TensorFlow.

  • So when you're inside a distribution strategy scope

  • when you create a variable, your distribution strategy

  • is probably setting up a variable creator

  • that's going to do the right thing for you.

  • And this is very important in TPUs, and in mirrored,

  • and stuff.

  • So it's good that we have this kind of flexibility.

  • But just like how creation is configurable,

  • deletion can be a little tricky.

  • So a nice side effect of having load bearing names

  • for variables in tf V1 is that it encourages

  • you to have very few of them, and to think very carefully

  • about what each of them was called.

  • So the set of variables throughout the lifetime

  • of a TensorFlow program was mostly fixed,

  • which meant that deleting variables is mostly

  • not a big deal.

  • And you could get away with very broad, wide big hammers,

  • delete variables, like session.reset.

  • But in tf V2 it is very easy with eager executionary

  • functions to create a lot of variables.

  • And you can create temporary variables.

  • So we do need to clean up after ourselves or we're going

  • to have memory leaks.

  • And you'd think that since this is Python,

  • you should be able to just override DEL to get variables

  • to clean after themselves.

  • But it's not that simple.

  • It turns out that if you override DEL on an object,

  • and that object becomes part of a reference cycle,

  • and if you've ever looked at the implementation of tf.variable,

  • you'll see it has tens of members.

  • So any one of them could point to something

  • that could point to something, that could point back

  • to that variable.

  • And if anything with a DEL is part of a reference cycle,

  • that entire cycle becomes uncollectable,

  • and we have leaked that memory forever.

  • However, there is an easy workaround,

  • which is that if you make an object that is guaranteed

  • to only have one or two data members that cannot possibly be

  • part of a reference cycle, you can override DEL on that

  • object, and then take an object that's complicated that might

  • be a part of a cycle, and store a pointer from that expensive

  • object to the small, cheap object that knows how to do

  • the cleanup.

  • This does not make the cycle uncollectable,

  • and still guarantees that the clean up

  • happens when the first object goes out of scope.

  • Now, the worst that can happen is

  • that our reference cycle means that your garbage collection is

  • not immediate.

  • It's just delayed until whenever the Python garbage

  • collector decides to run.

  • But that still guarantees correctness and a lack

  • of leaks, even though it might be a little surprising

  • that if you use sufficiently complicated objects,

  • your GPU memory might take a while to clean.

  • And you might need to use Python's GCE modules to force

  • it to clean up after itself.

  • And this pattern of making a deleter object

  • is used everywhere in the TensorFlow code base

  • that we have resources and that we

  • need to override DEL, just to ensure

  • that we have orderly cleanup.

  • So that's essentially all you need

  • to know about resources to effectively use them

  • in TensorFlow.

  • And now I'd like to move on to talk about variants.

  • And I put those two things together,

  • because for the longest time there

  • was a conflation of views between resources and variants.

  • Because resources were like the easiest way to just hook

  • arbitrary C++ code inside a TensorFlow runtime.

  • But it turned out that a lot of the things that we were doing

  • using resources to do were better served by not arbitrary

  • C++ code, but by stateless operations on immutable values.

  • And why would you want that?

  • Mostly because stateless things on immutable values

  • are much easier to compile.

  • And they're also much easier to differentiate through.

  • And differentiation is something we really care about.

  • So [INAUDIBLE] had the idea of making a separate D type

  • variant for immutable arbitrary C++ stuff.

  • Its implementation is very, very similar to something like absl

  • any, and other arbitrary types, like dynamic types in C++,

  • with a few bells and whistles to integrate better in the tf

  • ecosystem.

  • So a canonical example of variance

  • is the tensor list ops, which are used under the hood

  • to implement stacks in TensorFlow V2 in tensor arrays.

  • But also they are one of the original motivating factors.

  • And they look like this.

  • You can have an op that makes an empty tensor list.

  • Then you can have another op that takes a list and a value,

  • and spits out a new list that represents the concatenation

  • of those things.

  • And then you have an op that takes a list,

  • and spits out a slightly shorter list,

  • and the value is removed from the list.

  • And you can inspect those values and manipulate them.

  • And the fun thing about these is that because these are all

  • immutable, you can easily define their gradients.

  • And if you think about it, the gradient of push is pop.

  • The gradient of pop is push.

  • The gradient of set item is get item.

  • It mirrors very nicely.

  • So you get code that's efficiently differentiable up

  • to higher orders.

  • And internally, the tensor list structure can be very simple.

  • It's just an std vector of tensors

  • and some metadata about shapes and D types.

  • We need these methods in code and decode

  • so that we can serialize and de-serialize lists in case

  • we need to send them across devices.

  • Though specific variants can choose

  • to not implement those methods and throw errors instead.

  • And if you've been following this,

  • though, and you saw the previous slide where I had a CD vector,

  • and you saw the slide before that where the ops would

  • take one and return a new one, you

  • might have been terrified that this had automatically

  • made every single recursive neural network O of N squared.

  • But the TensorFlow runtime has this nice optimization

  • where a kernel is allowed to ask the runtime if anyone else is

  • ever going to use one of its input tensors again.

  • And if the answer to that question is no,

  • the kernel can go and mutate that tensor.

  • So this, incidentally, is how tensor lists work.

  • And in the normal use cases, like when

  • you're using them for stacks, after you've pushed something

  • into a stack, there are no more references

  • outstanding to the previous value of the unpushed stack.

  • So we can just reuse its memory and append,

  • and get exactly the same O of N performance

  • that you would expect to get from the stateful version.

  • However, we're doing this with stateless operations.

  • So we get to differentiate through this code.

  • And if you do end up holding an extra reference to something

  • that you want to mutate or apply a mutating op later,

  • the system will silently do a copy behind you

  • to ensure the correct behavior.

  • And this is also good, because we, again,

  • managed to decouple the behavior from the implementation.

  • So we can take operations that have exactly this meaning,

  • give them to a compiler.

  • And the compiler might be able to [INAUDIBLE] that

  • copy if it can prove that it happens at some point in time.

  • Or use a different internal representation

  • for these tensors.

  • Yes.

  • AUDIENCE: And this copy is just the copy

  • of a vector of tensors, and the tensor buffers themselves.

  • ALEXANDRE PASSOS: The tensor buffers

  • themselves never need to be copy, because that's

  • a separate level.

  • But again, even if you just copy the vector of tensors,

  • you can still see that show up in some profiles.

  • So one more thing you need to do if you

  • want to define your own variant D type

  • and have it work seamlessly with automatic differentiation

  • is you need to tell TensorFlow how to add two of these,

  • and how to make a zeros like, because these

  • are operations that auto diff needs to do all the time.

  • It's not obvious why auto diff needs to make zeros.

  • And happy to talk about this some other time.

  • It has something to do with differentiating operations that

  • have multiple outputs, and doing that

  • in a single bit of a code that doesn't have to be aware

  • that some of those outputs might not have been used so they

  • do not have upstream gradients.

  • So essentially, this is it.

  • This should be all you need to know to understand how state

  • and how arbitrary C++ stuff is represented in TensorFlow.

  • There are many other variant D types

  • other than the tensor list.

  • That is just one.

  • That was one of the first ones.

  • And it's one that showcases all the little bits

  • in there, which is why I chose to talk about it.

  • Similarly, there are many other resource D types

  • other than the variable one.

  • But variable is by far the most complicated.

  • So if you understand how that works,

  • you should understand all the others.

  • Happy to take questions now.

  • But if you're watching this on YouTube

  • and you're not in this room, you can email your questions

  • to developers.tensorflow.org, where we have discussions

  • about TensorFlow internals.

  • AUDIENCE: I have on question.

  • So could you talk a little bit about the decision

  • to have this one catch all type versus making

  • it easy to add new D types?

  • ALEXANDRE PASSOS: The one catch all type for resource?

  • For variant?

  • AUDIENCE: For variant.

  • Yeah.

  • ALEXANDRE PASSOS: Ah.

  • AUDIENCE: [INAUDIBLE]

  • ALEXANDRE PASSOS: Yeah.

  • That's a questionable decision.

  • I think it mostly comes from the fact

  • that originally TensorFlow did not make it

  • very easy to add new D types.

  • There are all sorts of enumerations

  • and specializations that have to happen on a per type basis.

  • So having a hook that lets you easily

  • add a type without any changes to the runtime

  • was considered important.

  • I don't necessarily think that this is the end stage.

  • And maybe at some point in the future

  • we should stop representing lists as a variance,

  • and start representing them as a list D type.

  • Which will allow the runtime to specialize to them in a better

  • way.

  • AUDIENCE: So D type would become a string instead of an int.

  • ALEXANDRE PASSOS: But in the full case,

  • the D type has become a string instead of an int,

  • we'd have to stop having switches from D types

  • everywhere in our code base.

  • But it might make sense to add lists as one of the ints.

  • But, again, each D type string will dramatically

  • increase the TensorFlow binary size,

  • because we need to register all sorts of kernels

  • for all D types, even if we don't have to.

  • There a few unintentional side effects.

  • It makes sense to specialize to a small set of things,

  • like fast, dense buffers of numbers,

  • because that's most of our expensive computation.

  • AUDIENCE: What have been some of the more common pitfalls

  • that you've seen that people had?

  • Like buggy or racy code initially.

  • And as they've gone to resource variables,

  • they need to either restructure their code,

  • or work with the bugs that [INAUDIBLE]..

  • ALEXANDRE PASSOS: There are many, many, many, many, many,

  • many bugs.

  • AUDIENCE: Global step is one.

  • ALEXANDRE PASSOS: Yeah.

  • Sorry?

  • AUDIENCE: Global step is one.

  • ALEXANDRE PASSOS: Yeah.

  • The one that most people see is that if you go back

  • to this guy?

  • Yeah.

  • S Graph element.

  • One unintended consequence to this

  • is that because session.run is not

  • allowed to create a new operation in the graph,

  • reading the variable has to pre-create a tensor that

  • reads its value, which means that if you fetch

  • a variable on the same session.run step

  • as you have done some mutating operation,

  • and the variable's a resource variable,

  • you're guaranteed to see the value before the mutation.

  • The read will happen before the assign,

  • just because they don't have any control dependencies.

  • Now, not guaranteed.

  • You're almost guaranteed.

  • Because they don't have any control

  • dependencies either way.

  • So you get non deterministic behavior.

  • But the read is cheap and runs fast,

  • and it has no dependencies on it.

  • But usually you have to compute something to get to the assign.

  • While with ref variables, because

  • of that aliasing behavior, you're

  • fairly likely to see the value after the assignment

  • under the assumptions that everything was

  • on the same device and stuff.

  • So you have all sorts of unit tests

  • that people write that get confusing.

  • This, I think, is the only regression we've had.

  • If you look at bugs from variables, there are many.

  • You'll see them in sanitizers, like thread sanitizer

  • and address sanitizer fire up on the TensorFlow runtime

  • often, due to those race conditions involving variables.

  • My favorite one is a combination of ControlFlow V1 and ref

  • variable, because ControlFlow V1 on the conditionals

  • doesn't create a single predicate.

  • It creates many switch operations.

  • And if the input to those switch operations

  • is a reference variable in one of the branches assigned

  • to the variable, then half of your switch operations

  • are going to execute one branch.

  • And the other half is going to execute the other branch

  • of the conditional.

  • And with that TensorFlow [INAUDIBLE],,

  • this can lead to bizarre, undefined behaviors.

  • This is a very fun one.

  • And another problem is that you can have optimizations

  • that you can apply.

  • For example, Grappler likes to rewrite things

  • like tensor plus 0 to tensor.

  • Because sometimes that zero might have been added there

  • by some complicated graph, that graph

  • that I just managed to constant fold and prove

  • that it's a zero.

  • And due to the implementation details

  • of ref variables, plus it's guaranteed to copy a variable.

  • So if you wanted to copy a variable so that you could have

  • its value before a write to compare it with the value

  • after a write and find out by how much it

  • changed, and Grappler rewrites your plus zero to just

  • keep it the value of the variable,

  • now your program has been broken.

  • So you have all these very subtle interactions

  • between things that you would think are harmless.

  • So you see a few hedge patterns in TensorFlow a lot.

  • You see people putting lots of identity tensors

  • in different devices to force a send and receive,

  • to force a copy.

  • You also have the gradient code.

  • It has this fun fact, where if you're backdropping

  • for a deep neural network, once you've

  • computed the gradient with respect to a variable,

  • you can do two things.

  • You can update the value of the variable.

  • Or you can compute the gradient with respect

  • to the input layer.

  • Well, computing the gradient with respect to the input layer

  • is a matrix multiplication, or transpose convolution

  • between the value of the variable

  • and the upstream gradients.

  • So if you've already mutated the value of the variable,

  • you're now computing the wrong gradients.

  • And so this leaked into the gradient code,

  • which has a gate gradients argument due to ref variables

  • so that it protects the backdrop from being affected

  • by assignments to the variables.

  • Which has side effects, which means things

  • like we lose performance.

  • Like the topmost layer of a neural network.

  • You only need to compute the gradient with respect

  • to the variables, not with respect to the inputs.

  • But because of the gating code, we

  • have to force the computation with respect to the inputs

  • so that you can guarantee that if there was a layer before it,

  • we would have updated that variable before we

  • had seen the gradient with respect to those inputs.

  • It also does not allow us to overlap variable updates very

  • well with their gradient computation.

  • I can keep going.

  • There is a lot of code in this send, receive

  • scheduling that tries to prevent complicated deadlocks that

  • can happen when you have variable assignments, and also

  • complicated cases where you tend to not

  • send the value of the variable that you thought

  • you were sending.

  • AUDIENCE: So I guess a follow on from this

  • would be that in a world with only resource variables,

  • does this present opportunities to remove some behavior,

  • or things that people were kind of relying on?

  • ALEXANDRE PASSOS: Yeah.

  • There's a lot of code that we will

  • be able to delete if we no longer have

  • to support ref variables.

  • And a lot of this code is complicated, and buggy,

  • and very hard to maintain.

  • AUDIENCE: Let me ask the reverse of that question.

  • Do you know of anybody who is actually relying

  • on ref variable behavior?

  • ALEXANDRE PASSOS: Yes.

  • AUDIENCE: So that issue that I told you about,

  • the plus 1, there's this thing called

  • the global step in Estimator that

  • is incremented on every training step,

  • and is read on every training step.

  • And every estimator user has a bunch

  • of hooks that rely on checking the value of the global step

  • after every training step.

  • So everybody who is doing estimator training

  • in a single machine case is effectively

  • relying that they can read the global step after it, right,

  • by just separately fetching it.

  • AUDIENCE: And they don't care necessarily

  • if the value is out of sync between the different reads?

  • AUDIENCE: The ref behavior ends up

  • being that they get the value after the assignment.

  • ALEXANDRE PASSOS: Because it's an int variable,

  • it gets forced placed on the CPU.

  • It has all these silent requirements

  • that conspire to allowing people to rely on this.

  • So our own code relies on this quite a bit.

  • In practice, it's not a big deal,

  • because most Estimator users are doing on distributor training.

  • And when you do distributor training your variables up

  • on other devices, you no longer have this guarantee

  • that you will always read exactly the value

  • after the assignment.

  • So all the cooks have to be robust in not reading that.

  • But the unit tests for the hooks rely on the fact

  • that they all run on the same device.

  • That is a big one.

  • I have seen some cases where you might rely on the fact

  • that you can do both snapshots and sparse

  • rights to a variable efficiently in the ref variable case

  • with race conditions.

  • If you're implementing some neuro computational memory

  • thingy, you might want that behavior.

  • And that's one of the cases where

  • I think we might need to just implement

  • a separate policy for the how to do

  • your variables to make it work.

  • AUDIENCE: So why not use variance

  • for literally everything?

  • Just get rid of all the other D types?

  • ALEXANDRE PASSOS: Because we can specialize the runtime

  • to the other D types to make it faster.

  • If you have variants, you have to do more runtime dynamism

  • to figure out-- like if you want to add to floats, now

  • you have to check at runtime that they are floats.

  • And you need to rewrap them into the variant thing, which has

  • one extra point or D reference.

  • AUDIENCE: And you get also less [INAUDIBLE]..

  • ALEXANDRE PASSOS: Yeah.

  • AUDIENCE: Yeah.

  • While [INAUDIBLE], you mean?

  • AUDIENCE: Yeah.

  • Well, it might not have been two float.

  • ALEXANDRE PASSOS: A float and an int.

  • You wouldn't know.

  • AUDIENCE: Or a float and a string.

  • ALEXANDRE PASSOS: Which is, incidentally,

  • one of the reasons why you want to move lists out

  • of variants so we can get better type checking for them.

  • So that you don't accidentally add like a list and a mute x

  • together or something like that.

  • AUDIENCE: But would that be just a list,

  • or a list with a pair of what the types of elements would be?

  • ALEXANDRE PASSOS: It will be very interesting

  • if we could extend the notion of type in a TensorFlow graph

  • to include a little more information than just an enum.

  • I think that's a separate project.

  • And I don't know if anyone is working on it.

  • AUDIENCE: But would that checking only

  • be the first step, though?

  • ALEXANDRE PASSOS: Yes.

  • AUDIENCE: That's not too bad.

  • AUDIENCE: So you mentioned having

  • more types close to the binary.

  • Is that just inflow?

  • Or 32 and 64, all of these?

  • ALEXANDRE PASSOS: All of these.

  • And incidentally, we also don't have

  • very good coverage of our type.

  • So as of the recording of this talk, a lot of the U int types

  • only work inside XLE.

  • And they sometimes work outside of XLE

  • if you go for the ops that don't actually look at the types,

  • like identity.

  • But if you try to do operations on them, most of the kernels

  • are not registered.

  • And there's no good reason for that,

  • other than like binary science and legacy.

  • It's just you end up with lots of holes

  • when you write more D types that take some time to patch.

  • OK.

  • I think this is about as much time as we have.

  • So thank you very much.

  • [APPLAUSE]

ALEXANDRE PASSOS: Hi.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it