Subtitles section Play video
ALEXANDRE PASSOS: Hi.
My name is Alex.
And I'm here to tell you today about resources and variance.
And really this is a talk about state in TensorFlow
and stuff that got accidentally represented in state
in TensorFlow for far too long.
So what is state?
I would love to be able to stand here, or rather
sit here, and tell you that an operation is stateful
if either executing it has a side effect,
or if its output depends on something
other than the value of its input.
But this is not what TensorFlow means by state flow.
Sadly, TensorFlow goes by the [INAUDIBLE] notion
that the meaning of a word is defined by its usage.
So state in TensorFlow is defined
by this one bit that gets flipped and means all sorts
of very interesting things.
So, for example, this slide is wrong.
tf.print is stateful.
It has a side effect.
Yay.
tf dataset from tensor slices has no side effects,
because the data set operations are value types,
and they're stateless.
And yet, that kernel is marked as stateful.
Because one of the effects of marking something
as stateful in TensorFlow is that it
disables constant folding.
And constant folding can be buggy with data sets.
Iterators, on the other hand, are stateful.
This might lead you to think that there
is some meaning to this.
But there are also some things in TensorFlow
that could go either way.
So to differentiate while loops, we have stacks.
So that when you're doing the forward press of the loop,
you push things into a stack.
And when you're doing the backward pass,
you pop things from the stack.
So you can look at intermediate activations and stuff.
And those things were stateful in tf V1,
but they're stateless in tf V2.
Tensor lists that you can use to aggregate stuff
from many iterations of a loop into a single view,
or do the reverse, they're also stateful in tf V1 and stateless
in tf V2.
AUDIENCE: Is that because we didn't invent the stateless way
until later?
ALEXANDRE PASSOS: Because we did not invent the stateless way
until later.
Yes.
So I want to spend the rest of the stock talking about how
statefulness is represented in tf V1, some of the problems
with that, how we're fixing those problems in tf V2,
and how we can deal with state, and also with things
that are not necessarily easily representable
with dense tensors.
So how is statefulness represented?
In one of two ways--
the most obvious way is that if you will go on the TensorFlow
source code, and you find where ops are registered,
you will see this bit.
Set is stateful.
And the definition of state in TensorFlow
is that opdevs that have this bit set are stateful.
And all sorts of places in the runtime
are going to look for that bit and behave differently
if that bit is set.
And people set the bit because they
want any of those behaviors.
And this is something we need to clean up.
And I think we might have a chance to clean this up
with the MLIR dialect of TensorFlow, which
is going to have more finer grained bits.
But until then, we're stuck with this one bit
that has too much precision.
So among other things, what does this bit mean?
It means that TensorFlow will not do constant folding.
This includes the two or three separate systems
in TensorFlow that do constant folding.
All of them know how to bypass stateful operations.
Similarly, there are at least two different places
in TensorFlow that do common sub expression elimination.
And they refuse to do common sub expression elimination
of stateful operations, which is very good, because if you're
to do that, and you have a neural network
with many layers, and your layers are initialized
from a random op, all of the layers with the same shape
would be initialized with exactly the same random values.
AUDIENCE: And all your prints would potentially
be collapsed into a single print.
ALEXANDRE PASSOS: Only prints of the identical string
would be collapsed into a single print.
Because otherwise we would have enough information
to disambiguate those.
But statefulness also means some things that
are not very obvious at all, like the op kernel
instances that the runtime uses to represent the computation
to run are reused across sessions for op kernels
that are for stateful ops that have the same name.
And there are also a somewhat long tail of obscure behavior
changes, like parallel four behaves
slightly different for stateful operations.
And people are known to set a stateful bit for any one
of these reasons and more.
The other way of representing state in tf
that we're trying to get rid of in tf V2
is the notion of a ref tensor.
And going back to the variable op,
it is this thing here, where you can
say that a tensor is either of a D type, or of a D type
ref of D type.
And what that means is that the reason why we did that
is that it's very convenient in many cases
to be able to keep information in the runtime that persists
across call session.run.
Specifically, the variables-- if you
had to write your code like this, where every session.run
you'd feed your variables and then you'd fetch them back,
and you were doing some kind of distributed training,
you would have so many network round
trips and so much extra latency for this,
it would be completely impractical.
So the idea of the variable op, which
is the thing that motivated the ref tensor, is like a constant,
but mutable.
And if you try to dig for the runtime,
you'll find this piece of code, which
I think is the most concise representation I could find
of how do we represent the distinction between a ref
tensor and an auto tensor.
This is what the input to an op kernel looks like.
And it's essentially a manually implemented ABSL one off,
where one, it's either a manually constructed tensor--
and the manual constructor isn't there,
just so we don't try to re-initialize it
in case we're not going to need it--
or the spare of a pointer to a tensor
and a pointer to a mute x.
And if you've ever programmed in C++,
you should be terrified right now, because you see a pointer,
and you see no comment about who owns this pointer,
and what is the lifetime of that pointer?
And a good third of the issues of ref variables
come from the fact that it's been impossible or very
hard to retrofit into the system a coherent notion of ownership
of this pointer that's going to be memory safe.
But that's not all.
The way the ref variables work is
that you have a graph that looks like this.
You have this variable node whose output
is a tensor that can change, and you can feed it
to an operation that mutates it, like assign,
or you can feed it to an operation that does not
mutate it, like identity.
If you feed it to an operation that does not mutate it,
like identity, the TensorFlow runtime
will silently cast that tensor star to a tensor.
So make another tensor object that
aliases the buffer pointer by that tensor,
and just keep going.
So the reason why I like this graph is that it's short.
It's simple.
If you look at every single gradient update
that we use for training, it kind of looks like this.
But it's also kind of tricky.
So we have, I don't know, like 20, 30 people in the room now.
Can I get a show of hands on who thinks
that the result of the print is the value after the assign?
No one.
AUDIENCE: What do you mean?
The print?
ALEXANDRE PASSOS: So this graph, it
has an add that takes as input the identity of the variable,
and some constant.
And it prints.
And it has a controlled dependency
from an assign that mutates the value of the variable
to the add.
So how many people think this is enough to ensure
that add will see the value of the variable
after the assignment?
OK.
About five or six.
AUDIENCE: Yes.
ALEXANDRE PASSOS: How many people
think that add will see the value of the variable
before the assignment?
About two or three hands.
How many people think this is a segmentation fault?
[LAUGHTER]
No one.
And how many people think it depends on things that
are not written in the graph?
AUDIENCE: 100.
ALEXANDRE PASSOS: OK.
So all of you have been bitten by this, because I
got like 15 hands now.
Is this completely non deterministic,
and it depends on all sorts of runtime properties?
For example, if everything is in the same device,
and the assign does not change the shape of the variable,
because of the way we do aliasing inside the TensorFlow
executor, print will print the value after the assignment.
However, if the add is in a different device
from the variable, then most likely there will be an RPC,
and add will sometimes see the value after,
sometimes see the value before the assignment.
There is one case where add is guaranteed
to see the value before the assignment, which
is if the assignment changes the shape of the variable.
Because if the assignment changes
the shape of the variable due to intricate details
of the implementation of tensor in tensor buffer,
we do not change your existing tensor buffer.
We just allocate a new one.
And by the time the identity runs,
it has already aliased the old tensor buffer.
And you can get a seg fault here,
as you might have guessed, if we're
talking about string D types.
Because add is defined for string types.
And if you have two separate threads that are reading
and writing to a string in C++, you're very likely to get a seg
fault or some other weird behavior.
So this is pretty complicated and kind of
unworkable in the long term.
You need to know about all sorts of things that are not
well documented, and that rely on specific details
of the implementation that are not guaranteed to stay stable.
And if you were to try to design something
like a compiler for TensorFlow, this
would be really hard to make work.
So we're not doing this anymore.
And I'm going to spend the rest of this talk, hopefully,
telling you about how we're fixing this
and imposing some order onto the situation in tf2.
And the interesting thing is that internally the way
variables have always been represented in Tensor Flow--
almost always, I guess, since the first open source
release is the state has been stored in this resource manager
object, which has a create, a look up, and a delete method.
And these can return some arbitrary type.
We use some RTTI magic to make sure
that the code is runtime and type safe.
We even implement RTTI on compilers that do not
have RTTI to make this work.
RTTI, sorry, is a C++ thing for runtime type identification.
And this is a perfectly reasonable API
if you wanted to represent state outside of the graph.
So the idea that we had in tf2 is
let's use this to represent the state as essentially
operations in the graph.
So there's still some issues of resource manager,
like its scope to device objects.
And device objects have a weird lifetime.
Sometimes they outlive a session.
Sometimes they do not outlive a session.
And it's slightly different with eager execution.
And this can be very surprising in some cases, both
when you're doing parameter server training,
and when you're not, and you accidentally
find yourself doing parameter server training,
and intentionally serving parameters between two models
that are not supposed to.
But overall, it's a reasonable API.
So what we did is we created a tensor D type, just
like string, or int, or float, that
represents the information you need to look something up
in the resource manager.
And we call this, creatively, DT Resource.
The reason why this is a tensor is that this is just
another value.
So you can pipe it through a graph.
You can stack things together.
You can select them dynamically if you want.
Or you can just use them statically.
It's just a scalar most of the time.
Like you can have non scalar tensor DT resources,
but most of the interesting operations just want scalers.
And then you can use this tensor to manipulate a resource.
So internally it's, again, just information
you need to make the lookup to the resource manager
minimally type safe-- so the device, a container, a name.
The container was this idea that we had originally that you
would be able to run many separate models
on the same parameter server and provide some kind of isolation
where you could reset the variables in one model,
but not the variables in the other model.
The way this was implemented made this very hard to use.
And I know very few people who rely on this now.
So these days, it's mostly just [INAUDIBLE]..
But otherwise, it has a name and some information
to validate that you're looking up
the object of the right type.
But that's all there is to it.
And resources are special cases in a couple
places in the runtime--
not as many as the stateful bit.
And one of them is that if you create an op that specifically
manipulates-- either takes or returns a tensor of a resource
D type, we mark it as stateful, because you assume
that if you're asking for a key to something in a resource
manager, you're probably going to monkey around with it.
And this at least removes the redundancy, because otherwise
you would have all these ops that would take resources,
modify state in a resource manager,
not be marked as stateful, and you
would have to wait until they got accidentally
constant folded together to see something break.
And the second one is that the placer will always
co-locate operations that manipulate
a resource with the device where the resources is in.
And this is because you can't really
modify a structure that's in another computer
without running code on the other computer.
But mostly resource handle is safe in the runtime.
And the interesting thing is that now our graph
that was very hard to read looks like this.
You have this far handle op that represents
the resource handle, the key.
And you can pass that key to your assignment.
You can pass that key to your read operations, et cetera.
And now I'm pretty sure everybody should agree with me
that this graph, as written, has to return
the value of the variable after the assignment.
Otherwise, it's a bug.
And this is true.
There is no weird non determinism.
It doesn't matter whether the shape changes or doesn't
change, what D type you're dealing with,
what device things are on.
Also, there is no way to make this seg fault, I believe.
So it's substantially nicer.
There's still some subtle things in here.
One of them is resource gather.
It's an operation that you would think, why would I need this?
Because what it does is it does effectively
what read plus gather do.
But it does it in the single op.
And the reason why we have this is
that if you think about this, if I really
want this to provide proof forever that this graph is
allowed to have the same meaning of always reading
the variable after the assign, and if you had flipped
that control dependency between read and assign,
you would now be always reading the variable before the assign,
you might have to make a copy to ensure
that this memory is preserved.
And if you have a very large vector of embeddings,
making copies of it can be very expensive.
And we would like to provide good performance.
So really this resource thing is more
a specification of the meaning of a graph that
has these operations and less the specific details of how
they're implemented.
It's possible to have many valid implementations of this,
and they're going to have different performance
characteristics.
So, for example, if we load our graphs to XLA for computation,
XLA can take a cluster of ops that have a bunch of reads
and writes to variables, look at the state of the variables
before they're clustered, figure out what the state of variables
should be after the cluster, and rewrite
it to be a bunch of reads, some stateless computation, and then
a bunch of assigns.
And this correctly preserves the semantics of these operations.
And it's a perfectly valid way to do this.
We don't always run XLA, though.
And if you start thinking about this,
there are two relatively straightforward ways
you could implement variables.
And they have pretty strong performance trade offs.
A very obvious implementation is copy
on write, where you would copy the buffer for a variable
every time we write to it.
Another one is copy on read, where the read operation is
going to copy.
And then the assign operation is just always going to mutate.
The interesting thing is that copy on write,
if all you're doing is your standard std training where
you read a bunch of variables in the beginning,
do a bunch of forward and backward computation,
and then you write to the bunch of variables,
you can do this with zero copies.
Because by the time you're writing to the variables,
there are no outstanding reads left.
So yay.
Similarly, if you have embeddings,
and you are sparsely reading a few rows from your variable
in arbitrary, random order, and then
later on you're going to sparsely write to those rows,
we can do this with no copies if we have copy on read.
I mean, no extra copies.
Since the reading would have to copy anyway,
because it's reading in an unstructured way
that we couldn't preserve, like strides or something like that.
So which one do we choose?
And effectively, we chose both.
And we did this by storing a bit on variables
and having variables always start in a copy on write mode.
And as soon as you do any sparse operation on a variable,
we grab an exclusive lock, and make any copies that we need,
and put it on Copy and Read mode.
This works reasonably well for both the only use
this variable in dense operations case,
and you only use this variable for embeddings' case.
It's not necessarily generally the best idea.
So I expect this policy might have to change and become
more refined over time.
But again, this is just an implementation detail.
And this does not affect the correctness of the programs
that are running on TensorFlow.
So I think it's a big improvement of--
AUDIENCE: Can I clarify?
ALEXANDRE PASSOS: Yes.
AUDIENCE: So I thought when we read something,
effectively it makes a copy.
It seems like this copy is specifically
in the context of [INAUDIBLE].
ALEXANDRE PASSOS: It pretends to emit a copy.
So the definition of a read is that a read is guaranteed.
An operation that looks at the output of a read
is guaranteed to see the effect of every operation that
had an edge pointing to the read and not
see the effect of any operation that had
an add pointing from the read.
You can implement this by making a copy on read.
You can also implement this by making a copy on write.
You can also implement this in more complicated ways
that might never make a copy.
AUDIENCE: So our default copy on write
looks at the reference count, and if it's
one, just updates in place.
And our default read operation just increments the reference
count.
ALEXANDRE PASSOS: Yes.
AUDIENCE: The default copy on write implementation.
ALEXANDRE PASSOS: The copy on write semantics do that.
And I assume we're going to eventually switch
to more complicated policies.
For example, we could look at the graph,
and then decide what policy we're
going to use to write the variables on this graph.
Or we could let users configure this.
There are many options here, but ideally, we
should be able to implement all of them
without requiring that users change the graph
structure to get better performance
or to get correctness of their behavior.
And this is what's important about this, because this means
that we got to fundamentally and dramatically change
the back end, like use a compiler,
and not have to worry about preserving
bug compatibility, what happens if you
A list the output of identity on another variable,
or something like that.
So far I've mostly focused on how
the runtime treats variables.
But the same fundamental patterns
of a handle tensor and operations that read and write
to it is used in all sorts of other bits
of runtime state in TensorFlow.
This includes the data set iterators, FifoQueus,
HashTables, and a few more things that I have forgotten.
AUDIENCE: Are mute xs resources?
ALEXANDRE PASSOS: Mute xs, they're a resource.
But they also have a variant that represents the mute x lock
object.
So it's a slightly funner situation.
But as far as the resource part of the mute x part
is concerned, it's, again, a mutable resource
tensor that has a handle.
It has operations to modify it.
So this is nice.
And this is just essentially what the runtime looks like.
And if you have this picture in your head,
you should be able to mostly predict
the behavior of TensorFlow programs that manipulate state.
One other bit is TensorFlow, there's shape inference.
I'm sure if you've looked at TensorFlow op registrations,
you've seen annotations like this where we set shapefn.
The result of shape inference is not persisted in the graph.
It's ephemeral.
It's produced every time we create a graph,
or while we're importing a graph.
But this is very, very useful to ensure not only
that we know how to interpret the graph correctly
and that the graph is valid.
But this is very helpful during the graph building process,
where user code can inspect the inferred shapes of nodes,
and make different decisions as to whether things can be
dynamic or static in the graph.
And if all the resources are scalers,
this would make it hard to do shape inference
on stateful operations that manipulate resources.
So we did kind of a hack that should be improved
and added a side channel to the shape inference process,
this output handle, shapes, and types
that can store an arbitrary list of shapes and D type objects.
And different resources and variants
are going to assign different semantics to this.
Operations like cast that do not affect the shape, just pass
the shapes and D types through, and then
operations that are aware of what the resource handles
are doing are going to look at this
and assign meaning to them.
So variables just store a single shape in D type
for the value of the variable.
Tensor lists store a shape in D type in there for the shape
and D type of the elements in a tensor list.
Iterators store the shape and D types of all the tensors
that you're going to get when you call iterate
or get next so that we can properly do shape
inference on those graphs.
So now that you mostly have a reasonable picture of what
resources look like in the runtime,
I'd like to pop the stack and talk a little bit
about the Python side.
So this is going to mostly focus on variables, because I think
there are a few interesting things in there that
will, again, generalize through other bits in the runtime.
The first one is that if you've used TensorFlow before,
you know the variables act like tensors.
You can pass them to operations.
You can use the operators on them.
And part of this reason is historical.
I think the first implementation of Variable in TensorFlow
was literally just the return value of the Variable op.
And that happened to be a tensor of reference D type.
Later we felt the need to replace that with a class.
So we worked somewhat hard to make that class behave exactly
like a tensor.
And this is something that sometimes library writers
downstream from TensorFlow want to have their own types
that behave like tensors, or behave like variables.
So how do you do this?
And I strongly believe this is all you need.
First thing to do is you need to make your type
convertible to a tensor.
So there is a tf.register tensor conversion function
that takes the type and a function
to convert that type to a tensor.
In the case of a variable, it just
reads the value of the variable.
Easy.
There are some special cases in there
to deal with reference types that are no longer needed,
thankfully.
Another thing that you need to do
is register your type as a dense tensor
like type, which means that implicitly after a stack,
by just putting many instances of that type in a list
will work by silently reading and then calling stack.
Then you need to overload all operators.
And if you look, there's this method,
overload all operators in the class for tf Variable that
has implementation for this that will steal all the operator
overloads from tensor.
And there is a rule in TensorFlow
that session.run is not allowed to add nodes to the graph.
This can catch all sorts of terrifying bugs.
So it's good that we have this rule.
But if you want to be able to fetch the value of a thing,
then you need to implement this underscore as graph element
method, which session.run pokes to see
if it is there, which is supposed to return
a pre-existing tensor.
And so variables have to record a tensor that
is going to be the result of reading them.
So storing there, you can use session.run to fetch them.
There is also one more tricky bit
about the Python implementation of Variables
that you might need to know, which
is that in ref variables, because they can just convert
to a ref tensor, the following work you can take the return
value of an assignment operation,
and call another assignment operation on it,
and do that as many times as you want,
because assignment operations chain.
And in resource variables, clearly,
the assignment operations, they don't have a return value.
Because if you were to return something like the handle,
the handle, it's useless.
It's the same as the input.
No point in returning that.
If we were to return the value of reading the variable,
now that's an operation that might potentially
be very expensive.
And you'd like to not read it unless you're
going to need to read it.
So we added this notion of unread variable, which
is a class that if you have a control dependency on it,
it just has a control dependency on an assignment operation.
But if you try to read this value,
it's guaranteed to read the value after that assignment
operation.
And because this acts like a variable,
we can use this to make the chained assignment
work and a few other things.
So if you see unread variables in your graph,
you should know that this is kind
of where you're dealing with.
But if you've been paying attention,
you've seen that the core set of operations for a variable
does not self initialize.
And this is by design.
A lot of the early use cases of TensorFlow
were optimized for shared parameter server training.
And in that case, when you have multiple parameter servers,
and multiple workers all talking to each other,
you might want to initialize variables from scratch.
You might want to load them from a checkpoint.
And depending on your training policies,
you might want to do different things.
So the graph is agnostic as to how you do those things.
The runtime is agnostic how you do those things.
And the execution, like the session.run
gets to setup policy.
This is very important, because we
have to change the policy many, many times until we finally
made it mostly bug free in estimator.
But as with tf2, as we're not necessarily
saying that the default way to use
TensorFlow is shared parameter server training,
we went for ergonomics over safety.
So in tf V2, mostly variables are initialized on creation.
In eager execution, this is very easy to do.
Because as soon as you execute the ops, you create a variable,
we initialize it for you.
In tf.function, it can be a little trickier,
because the initializer for a variable
might be defined inside a function.
And there are a few ways to handle this.
And I'm going to go into detail on this in the tf.function
talk.
Similarly, variables sharing is a complicated issue.
If you're doing shared parameter server training,
you would like all the workers that
connect to the same parameter server
to see the same variable so they can see each other's rights
to those variables.
And the way we did this was to say the variables
are shared by name.
So in tf V1, variable names are load bearing.
If you change, or edit, or modify the names of variables,
you dramatically change the behavior of the program.
This is a questionable decision in all cases,
because variable names, they look very
harmless when you read code.
So in tf2, we chose to make names non load bearing.
Internally we're still using the runtime that
assumes a load bearing name, but we always
use a UID to hide that fact.
And if you want to have shared names for our parameter server
training, you can, because you can control
that detail in the runtime.
But the Python API no longer makes that straightforward.
And now you might be asking, well,
how would I be able to change how the details
and variables are implemented?
Another thing that we're adding in tf V2
is this notion of a variable creator
that lets you control how variables are created.
And so variables are a meta class
so that when you call tf.variable
you might not actually get an instance of variable.
You might get an instance of some subclass of variable
that defines some specific behaviors.
In tf V1, by default, you get ref variable.
In tf V2, by default, you get resource variable.
But in other contexts, you might get other instances.
The meta class code itself is not particularly interesting.
It's just you should probably know
this exists if you're dealing with variables in Python.
So when instances like tf.function
use its own subclass of variables,
it behaves slightly different from the V1 graph
resource variables when it comes to initialization,
so that it can capture initializers and things
like that.
And it's nice that we can keep that code encapsulated
within the tf.function package, and not push its complexity out
to the same variable class that is used everywhere.
Similarly, tf.distribute might need
to create replica variables or mirrored
variables with complicated read and write modes.
And that complexity can be mostly centralized
on a tf.distributor package instead of being
spread out all over TensorFlow.
So when you're inside a distribution strategy scope
when you create a variable, your distribution strategy
is probably setting up a variable creator
that's going to do the right thing for you.
And this is very important in TPUs, and in mirrored,
and stuff.
So it's good that we have this kind of flexibility.
But just like how creation is configurable,
deletion can be a little tricky.
So a nice side effect of having load bearing names
for variables in tf V1 is that it encourages
you to have very few of them, and to think very carefully
about what each of them was called.
So the set of variables throughout the lifetime
of a TensorFlow program was mostly fixed,
which meant that deleting variables is mostly
not a big deal.
And you could get away with very broad, wide big hammers,
delete variables, like session.reset.
But in tf V2 it is very easy with eager executionary
functions to create a lot of variables.
And you can create temporary variables.
So we do need to clean up after ourselves or we're going
to have memory leaks.
And you'd think that since this is Python,
you should be able to just override DEL to get variables
to clean after themselves.
But it's not that simple.
It turns out that if you override DEL on an object,
and that object becomes part of a reference cycle,
and if you've ever looked at the implementation of tf.variable,
you'll see it has tens of members.
So any one of them could point to something
that could point to something, that could point back
to that variable.
And if anything with a DEL is part of a reference cycle,
that entire cycle becomes uncollectable,
and we have leaked that memory forever.
However, there is an easy workaround,
which is that if you make an object that is guaranteed
to only have one or two data members that cannot possibly be
part of a reference cycle, you can override DEL on that
object, and then take an object that's complicated that might
be a part of a cycle, and store a pointer from that expensive
object to the small, cheap object that knows how to do
the cleanup.
This does not make the cycle uncollectable,
and still guarantees that the clean up
happens when the first object goes out of scope.
Now, the worst that can happen is
that our reference cycle means that your garbage collection is
not immediate.
It's just delayed until whenever the Python garbage
collector decides to run.
But that still guarantees correctness and a lack
of leaks, even though it might be a little surprising
that if you use sufficiently complicated objects,
your GPU memory might take a while to clean.
And you might need to use Python's GCE modules to force
it to clean up after itself.
And this pattern of making a deleter object
is used everywhere in the TensorFlow code base
that we have resources and that we
need to override DEL, just to ensure
that we have orderly cleanup.
So that's essentially all you need
to know about resources to effectively use them
in TensorFlow.
And now I'd like to move on to talk about variants.
And I put those two things together,
because for the longest time there
was a conflation of views between resources and variants.
Because resources were like the easiest way to just hook
arbitrary C++ code inside a TensorFlow runtime.
But it turned out that a lot of the things that we were doing
using resources to do were better served by not arbitrary
C++ code, but by stateless operations on immutable values.
And why would you want that?
Mostly because stateless things on immutable values
are much easier to compile.
And they're also much easier to differentiate through.
And differentiation is something we really care about.
So [INAUDIBLE] had the idea of making a separate D type
variant for immutable arbitrary C++ stuff.
Its implementation is very, very similar to something like absl
any, and other arbitrary types, like dynamic types in C++,
with a few bells and whistles to integrate better in the tf
ecosystem.
So a canonical example of variance
is the tensor list ops, which are used under the hood
to implement stacks in TensorFlow V2 in tensor arrays.
But also they are one of the original motivating factors.
And they look like this.
You can have an op that makes an empty tensor list.
Then you can have another op that takes a list and a value,
and spits out a new list that represents the concatenation
of those things.
And then you have an op that takes a list,
and spits out a slightly shorter list,
and the value is removed from the list.
And you can inspect those values and manipulate them.
And the fun thing about these is that because these are all
immutable, you can easily define their gradients.
And if you think about it, the gradient of push is pop.
The gradient of pop is push.
The gradient of set item is get item.
It mirrors very nicely.
So you get code that's efficiently differentiable up
to higher orders.
And internally, the tensor list structure can be very simple.
It's just an std vector of tensors
and some metadata about shapes and D types.
We need these methods in code and decode
so that we can serialize and de-serialize lists in case
we need to send them across devices.
Though specific variants can choose
to not implement those methods and throw errors instead.
And if you've been following this,
though, and you saw the previous slide where I had a CD vector,
and you saw the slide before that where the ops would
take one and return a new one, you
might have been terrified that this had automatically
made every single recursive neural network O of N squared.
But the TensorFlow runtime has this nice optimization
where a kernel is allowed to ask the runtime if anyone else is
ever going to use one of its input tensors again.
And if the answer to that question is no,
the kernel can go and mutate that tensor.
So this, incidentally, is how tensor lists work.
And in the normal use cases, like when
you're using them for stacks, after you've pushed something
into a stack, there are no more references
outstanding to the previous value of the unpushed stack.
So we can just reuse its memory and append,
and get exactly the same O of N performance
that you would expect to get from the stateful version.
However, we're doing this with stateless operations.
So we get to differentiate through this code.
And if you do end up holding an extra reference to something
that you want to mutate or apply a mutating op later,
the system will silently do a copy behind you
to ensure the correct behavior.
And this is also good, because we, again,
managed to decouple the behavior from the implementation.
So we can take operations that have exactly this meaning,
give them to a compiler.
And the compiler might be able to [INAUDIBLE] that
copy if it can prove that it happens at some point in time.
Or use a different internal representation
for these tensors.
Yes.
AUDIENCE: And this copy is just the copy
of a vector of tensors, and the tensor buffers themselves.
ALEXANDRE PASSOS: The tensor buffers
themselves never need to be copy, because that's
a separate level.
But again, even if you just copy the vector of tensors,
you can still see that show up in some profiles.
So one more thing you need to do if you
want to define your own variant D type
and have it work seamlessly with automatic differentiation
is you need to tell TensorFlow how to add two of these,
and how to make a zeros like, because these
are operations that auto diff needs to do all the time.
It's not obvious why auto diff needs to make zeros.
And happy to talk about this some other time.
It has something to do with differentiating operations that
have multiple outputs, and doing that
in a single bit of a code that doesn't have to be aware
that some of those outputs might not have been used so they
do not have upstream gradients.
So essentially, this is it.
This should be all you need to know to understand how state
and how arbitrary C++ stuff is represented in TensorFlow.
There are many other variant D types
other than the tensor list.
That is just one.
That was one of the first ones.
And it's one that showcases all the little bits
in there, which is why I chose to talk about it.
Similarly, there are many other resource D types
other than the variable one.
But variable is by far the most complicated.
So if you understand how that works,
you should understand all the others.
Happy to take questions now.
But if you're watching this on YouTube
and you're not in this room, you can email your questions
to developers.tensorflow.org, where we have discussions
about TensorFlow internals.
AUDIENCE: I have on question.
So could you talk a little bit about the decision
to have this one catch all type versus making
it easy to add new D types?
ALEXANDRE PASSOS: The one catch all type for resource?
For variant?
AUDIENCE: For variant.
Yeah.
ALEXANDRE PASSOS: Ah.
AUDIENCE: [INAUDIBLE]
ALEXANDRE PASSOS: Yeah.
That's a questionable decision.
I think it mostly comes from the fact
that originally TensorFlow did not make it
very easy to add new D types.
There are all sorts of enumerations
and specializations that have to happen on a per type basis.
So having a hook that lets you easily
add a type without any changes to the runtime
was considered important.
I don't necessarily think that this is the end stage.
And maybe at some point in the future
we should stop representing lists as a variance,
and start representing them as a list D type.
Which will allow the runtime to specialize to them in a better
way.
AUDIENCE: So D type would become a string instead of an int.
ALEXANDRE PASSOS: But in the full case,
the D type has become a string instead of an int,
we'd have to stop having switches from D types
everywhere in our code base.
But it might make sense to add lists as one of the ints.
But, again, each D type string will dramatically
increase the TensorFlow binary size,
because we need to register all sorts of kernels
for all D types, even if we don't have to.
There a few unintentional side effects.
It makes sense to specialize to a small set of things,
like fast, dense buffers of numbers,
because that's most of our expensive computation.
AUDIENCE: What have been some of the more common pitfalls
that you've seen that people had?
Like buggy or racy code initially.
And as they've gone to resource variables,
they need to either restructure their code,
or work with the bugs that [INAUDIBLE]..
ALEXANDRE PASSOS: There are many, many, many, many, many,
many bugs.
AUDIENCE: Global step is one.
ALEXANDRE PASSOS: Yeah.
Sorry?
AUDIENCE: Global step is one.
ALEXANDRE PASSOS: Yeah.
The one that most people see is that if you go back
to this guy?
Yeah.
S Graph element.
One unintended consequence to this
is that because session.run is not
allowed to create a new operation in the graph,
reading the variable has to pre-create a tensor that
reads its value, which means that if you fetch
a variable on the same session.run step
as you have done some mutating operation,
and the variable's a resource variable,
you're guaranteed to see the value before the mutation.
The read will happen before the assign,
just because they don't have any control dependencies.
Now, not guaranteed.
You're almost guaranteed.
Because they don't have any control
dependencies either way.
So you get non deterministic behavior.
But the read is cheap and runs fast,
and it has no dependencies on it.
But usually you have to compute something to get to the assign.
While with ref variables, because
of that aliasing behavior, you're
fairly likely to see the value after the assignment
under the assumptions that everything was
on the same device and stuff.
So you have all sorts of unit tests
that people write that get confusing.
This, I think, is the only regression we've had.
If you look at bugs from variables, there are many.
You'll see them in sanitizers, like thread sanitizer
and address sanitizer fire up on the TensorFlow runtime
often, due to those race conditions involving variables.
My favorite one is a combination of ControlFlow V1 and ref
variable, because ControlFlow V1 on the conditionals
doesn't create a single predicate.
It creates many switch operations.
And if the input to those switch operations
is a reference variable in one of the branches assigned
to the variable, then half of your switch operations
are going to execute one branch.
And the other half is going to execute the other branch
of the conditional.
And with that TensorFlow [INAUDIBLE],,
this can lead to bizarre, undefined behaviors.
This is a very fun one.
And another problem is that you can have optimizations
that you can apply.
For example, Grappler likes to rewrite things
like tensor plus 0 to tensor.
Because sometimes that zero might have been added there
by some complicated graph, that graph
that I just managed to constant fold and prove
that it's a zero.
And due to the implementation details
of ref variables, plus it's guaranteed to copy a variable.
So if you wanted to copy a variable so that you could have
its value before a write to compare it with the value
after a write and find out by how much it
changed, and Grappler rewrites your plus zero to just
keep it the value of the variable,
now your program has been broken.
So you have all these very subtle interactions
between things that you would think are harmless.
So you see a few hedge patterns in TensorFlow a lot.
You see people putting lots of identity tensors
in different devices to force a send and receive,
to force a copy.
You also have the gradient code.
It has this fun fact, where if you're backdropping
for a deep neural network, once you've
computed the gradient with respect to a variable,
you can do two things.
You can update the value of the variable.
Or you can compute the gradient with respect
to the input layer.
Well, computing the gradient with respect to the input layer
is a matrix multiplication, or transpose convolution
between the value of the variable
and the upstream gradients.
So if you've already mutated the value of the variable,
you're now computing the wrong gradients.
And so this leaked into the gradient code,
which has a gate gradients argument due to ref variables
so that it protects the backdrop from being affected
by assignments to the variables.
Which has side effects, which means things
like we lose performance.
Like the topmost layer of a neural network.
You only need to compute the gradient with respect
to the variables, not with respect to the inputs.
But because of the gating code, we
have to force the computation with respect to the inputs
so that you can guarantee that if there was a layer before it,
we would have updated that variable before we
had seen the gradient with respect to those inputs.
It also does not allow us to overlap variable updates very
well with their gradient computation.
I can keep going.
There is a lot of code in this send, receive
scheduling that tries to prevent complicated deadlocks that
can happen when you have variable assignments, and also
complicated cases where you tend to not
send the value of the variable that you thought
you were sending.
AUDIENCE: So I guess a follow on from this
would be that in a world with only resource variables,
does this present opportunities to remove some behavior,
or things that people were kind of relying on?
ALEXANDRE PASSOS: Yeah.
There's a lot of code that we will
be able to delete if we no longer have
to support ref variables.
And a lot of this code is complicated, and buggy,
and very hard to maintain.
AUDIENCE: Let me ask the reverse of that question.
Do you know of anybody who is actually relying
on ref variable behavior?
ALEXANDRE PASSOS: Yes.
AUDIENCE: So that issue that I told you about,
the plus 1, there's this thing called
the global step in Estimator that
is incremented on every training step,
and is read on every training step.
And every estimator user has a bunch
of hooks that rely on checking the value of the global step
after every training step.
So everybody who is doing estimator training
in a single machine case is effectively
relying that they can read the global step after it, right,
by just separately fetching it.
AUDIENCE: And they don't care necessarily
if the value is out of sync between the different reads?
AUDIENCE: The ref behavior ends up
being that they get the value after the assignment.
ALEXANDRE PASSOS: Because it's an int variable,
it gets forced placed on the CPU.
It has all these silent requirements
that conspire to allowing people to rely on this.
So our own code relies on this quite a bit.
In practice, it's not a big deal,
because most Estimator users are doing on distributor training.
And when you do distributor training your variables up
on other devices, you no longer have this guarantee
that you will always read exactly the value
after the assignment.
So all the cooks have to be robust in not reading that.
But the unit tests for the hooks rely on the fact
that they all run on the same device.
That is a big one.
I have seen some cases where you might rely on the fact
that you can do both snapshots and sparse
rights to a variable efficiently in the ref variable case
with race conditions.
If you're implementing some neuro computational memory
thingy, you might want that behavior.
And that's one of the cases where
I think we might need to just implement
a separate policy for the how to do
your variables to make it work.
AUDIENCE: So why not use variance
for literally everything?
Just get rid of all the other D types?
ALEXANDRE PASSOS: Because we can specialize the runtime
to the other D types to make it faster.
If you have variants, you have to do more runtime dynamism
to figure out-- like if you want to add to floats, now
you have to check at runtime that they are floats.
And you need to rewrap them into the variant thing, which has
one extra point or D reference.
AUDIENCE: And you get also less [INAUDIBLE]..
ALEXANDRE PASSOS: Yeah.
AUDIENCE: Yeah.
While [INAUDIBLE], you mean?
AUDIENCE: Yeah.
Well, it might not have been two float.
ALEXANDRE PASSOS: A float and an int.
You wouldn't know.
AUDIENCE: Or a float and a string.
ALEXANDRE PASSOS: Which is, incidentally,
one of the reasons why you want to move lists out
of variants so we can get better type checking for them.
So that you don't accidentally add like a list and a mute x
together or something like that.
AUDIENCE: But would that be just a list,
or a list with a pair of what the types of elements would be?
ALEXANDRE PASSOS: It will be very interesting
if we could extend the notion of type in a TensorFlow graph
to include a little more information than just an enum.
I think that's a separate project.
And I don't know if anyone is working on it.
AUDIENCE: But would that checking only
be the first step, though?
ALEXANDRE PASSOS: Yes.
AUDIENCE: That's not too bad.
AUDIENCE: So you mentioned having
more types close to the binary.
Is that just inflow?
Or 32 and 64, all of these?
ALEXANDRE PASSOS: All of these.
And incidentally, we also don't have
very good coverage of our type.
So as of the recording of this talk, a lot of the U int types
only work inside XLE.
And they sometimes work outside of XLE
if you go for the ops that don't actually look at the types,
like identity.
But if you try to do operations on them, most of the kernels
are not registered.
And there's no good reason for that,
other than like binary science and legacy.
It's just you end up with lots of holes
when you write more D types that take some time to patch.
OK.
I think this is about as much time as we have.
So thank you very much.
[APPLAUSE]