Subtitles section Play video
ALEX PASSOS: Hi, my name is Alex, and I'm here again,
this time to talk about the TensorFlow eager execution
runtime.
This is a very broad topic and there
are lots and lots of things we could cover,
so I'm going to lightly graze many, many different parts
of our code base.
I'll give you a lot of function names
and file names and things like that
that you can use to familiarize yourself with something,
but if you're in the room right now, by all means,
ask questions.
Like, this is-- there's some buffer time at the end
to account for variability here.
And I think the more we can maximize shared understanding
of this stuff, the better.
So the way I thought we could go about this
is to do a very, very deep dive on what
actually happens in TensorFlow, starting from TF2--
when you type a very simple line of code, in this case
a tf.nn.relu off some Python numbers.
And I think if you were to start doing this, probably
the first thing you'd do is graph the TensorFlow code
base to find where would we define ReLU.
And if you do that, you will find
that we have some definitions of ReLU in Keras,
but you won't find a single definition of ReLU itself
in core TensorFlow, and that might be
a little surprising at first.
It might put a damper on this whole,
let's find out what actually happens when we run ReLU
business, but the way ReLU comes from is because it's enough
that we've implemented in C++ and we didn't need to put
a complicated Python API around it.
We can just generate the Python code to call ReLU.
So the way it's actually defined,
it's defined using the same mechanism
we use to register all the ops in TensorFlow.
So for every core operation in TensorFlow
that's visible to the runtime, we
have a registration that looks like this--
REGISTER_OP.
It takes a name and you say how many inputs, how many outputs,
what attributes it has.
If you want to know more about attributes-- what things
are allowed in there-- they are how
we can make our ops polymorphic and have the same operation
have different types of outputs, so different numbers of outputs
and things like that.
There is a lot of documentation about this in TensorFlow.org,
if you search for how to define a new op.
And another interesting thing in there
is that we also register a shape_inference function.
ReLU, thankfully, is one of the simplest
ops we have-- it just has one input, one output,
they have the same shape.
So we can use a pre-built shape_inference function
that just says the shape does not change.
Other ops will have vastly more complicated
shape_inference functions.
And the nice thing is that we can
run these functions offline for our building graphs
without actually having the values of any tensors
and still be able to prove things
about the shapes of intermediate tenses and outputs
of your computation.
This is maybe the best tool we have for catching bugs now.
So if you want to look at the shape_inference code,
that's where you'd hook into.
Now that we have that registration,
we run some complicated code that generates Python code
to actually call ReLU.
And if you look in bazel-genfiles, files,
you will find a file named gen_nn_ops.py and this file has
the actual def ReLU that we call.
And as you can see, it's not pretty
and there's a lot of stuff going in there.
The first line deals are dispatching
so that we can define ReLU not just for normal tensors,
but also optionally for sparse tensors
and ragged tensors and other composite types.
The second line has tf_export and what this does
is define the TensorFlow public API.
Every symbol that you get when you are using TensorFlow by tf
dot something is defined somewhere
from a tf_export decorator like this one.
There will be a future video on how exactly this works
and why we do things this way instead
of relying on Python's normal, you know,
name spacing mechanism.
But you can probably guess that it's because TensorFlow
is very complicated.
But essentially, you'll see this,
and this generated code for ReLU has a bunch of cases in it.
That are roughly four.
You have an eager fast path, an eager slow path,
you have a graph mode path, and kind of a side hook
for the symbolic execution.
But here, let's focus on the eager paths.
In the first one, the first thing
that we're actually doing here is
we're checking to see if we're in eager mode or not.
And to do that, we look at this context thing.
This context thing is part of the core of the TensorFlow v 2
runtime.
It's the moral equivalent to the session,
but it's longer lived than the session
and represents more things.
So what is it?
From Python, the context is this class
that's defined in a file called [? context.ui. ?]
And it's a collection of a lot of things
that your Python program needs to be aware to connect
to the TensorFlow runtime.
It stores things like, am I in eager mode or in graph mode?
Or if someone used a with tf.device decorator, what
device am I supposed to be executing code in?
And it stores things like, what's
your name scope and many other things.
And all of this information that the context
stores, in general--
the things that can change during the execution
of a program--
they're all stored in ThreadLocal stacks.
Usually stacks, because we have these nested things
like with tf.device, with tf.device, with tf.device,
so you'd like to be able to pop the stack
to go back to where you were.
And ThreadLocal because it's very important to us
that a TensorFlow runtime itself be thread agnostic,
so that if you write two threads and one is doing
a reinforcement learning learner and the other's doing
an agent that's talking to some game, when the agent wants
to use its GPU, it shouldn't necessarily make the learner
use the U, and vice versa.
Providing some kind of resolution between the threads
is what we felt like was the right way,
so that at least each single thread
can feel like it's its own single-threaded Python program.
We use this a lot in distribution strategies,
like MirroredStrategy uses a lot of threads under the hood,
so it's really important that things are thread-safe.
In the Python context, essentially it mostly is
a wrapper around the C++ context.
It's available for the TensorFlow C API.
And this is the core thing.
It has a bunch more methods than just these--
like, you can do a lot more things
than just listing devices.
One thing I'd like to call out is that right now,
there are some things that are done in Python, like storing,
whether in eager mode and graph mode,
and some things that are done in C++.
And it's not the set of things that are done Python and set
of things that are done in C++ are likely to change.
I think as TensorFlow evolves, more and more things should
migrate from the Python context to the C++ context,
which will make things more language agnostic,
but also faster.
And you know, if everything in the context was in C++,
then all the generated Python code could have just been C++
code and it would have to--
we'd be able to get out of the overhead of executing Python
much sooner and remove performance problems
in our APIs.
So once you know you're in eager mode,
we try to do this fast path execution.
In this, the fast path is some complicated C code
that mostly does the same things that the fallback case is
trying to do.
So I don't think it's necessarily worth reading that.
I would rather look at the simpler
code and the fallback path.
And it's here.
So this is where we actually implement the ReLU function.
And what are the interesting things here?
First, we have this function called args_to_matching_eager,
and what it does is it takes a bunch of tensors,
a bunch of things that can be tensors,
converts them all to tensors of the same dtype.
And in the case of ReLU, there is only one.
But in the case of any other ops, like Add or MatMul,
they take multiple inputs, some of which might be tensors,
others might be numpy.arrays or Python lists or variables
or other objects that you can convert to a tensor,
but they're not necessarily tensors.
And this is the thing that's responsible for canonicalizing
everything.
Then once we have--
but here you might want to stop me a little and ask,
what is a tensor at the level of a TensorFlow runtime?
And the eager tensor class is a thing
that's half implemented in Python
and half implemented in the Python C API,
but it's a relatively thin wrapper around this thing
that we call TensorHandle that you can see in our TensorFlow C
API.
And this TensorHandle, it represents a potentially yet
to be computer tensor which is going to live on some device.
And we know what device it's going
to live on, because we know what device we executed
the operation that generates the tensor on.
And there are roughly three things
you can do to a TensorHandle, really.
You can ask about its shape and dtype and stuff-- that's one.
You can give it to execute--
that's another one.
And you can copy it to a device.
Also, I guess there are four things,
and the fourth thing is you can ask, hey, what is its value?
And when you want to force the evaluation of its value,
the TensorFlow runtime might have
to pause until it can give you the result.
You might need to copy the tensor from some remote device
and do all sorts of other things, but once you're done,
you get a TF_Tensor, which is essentially
just a pointer to some data.
But every other operation that is not
looking at the value of the tensor
is something that the runtime is free to delay, reorder,
as long as it respects the intended
semantics of your program.
And this is very important, because when
you're working with things like GPUs and TPUs,
you'd like to be able to dispatch operations that
are running the hardware as quickly as you can
in a way that's asynchronous.
If the Python thread can race ahead of the GPU
as it's executing operations, we can fully
utilize the GPU hardware and get the maximum performance.
And even if the eager runtime, we
can do this if the GPU kernels are sufficiently heavy.
There's a few cases in which we need to copy tensors,
even if they're local on the CPU.
And some of those cases are going away, like currently,
we copy string tensors every time we try to look
at their values because TensorFlow's internal string
representation is a complicated C++ class and we'd like to have
an API stable C type.
We're actually changing our internal string
representation-- there's an RFC you can look about it--
that will make the internal and external representations
both be an API stable C type.
But internally, what is a TensorHandle?
And this is, in fact, very, very similar
to the first implementation of TensorHandle.
Now it's like hundreds of lines of code spread
across many files, but all that you really
need in a TensorHandle is to know what device
it's in and what data you have.
And the data can either be some concrete tensor, a value that
has already been computed, or a future that
might be computed later because it's remote or asynchronous
or on some other device.
But the core of it is this--
just, this is the future that we handle around
in the representation.
The current code is much more complicated,
and you might want to look at it to see why it's complicated,
but the core idea is there.
So popping the stack.
You now know what a tensor is, and you've probably figured out
that converting that list of Python integers to a tensor is
not terribly hard-- there's C code that does that--
and now it's time for us to execute ReLU.
Great.
Here is one thing to note about how the TensorFlow
eager runtime works, which is that this line that
is selected there is not a noncontroversial choice.
Overall, there are two ways we could have gone about this.
We could have had a closed domain API, in which TensorFlow
would export a symbol called ReLU,
another called MatMul, another called conf, et cetera.
Or we could have this open domain API
that we have here, where we have a symbol
called execute that takes the name of an operation ReLU
and a bunch of metadata about it.
There are advantages and disadvantages to both cases.
In general, the closed domain case,
where you just have a single endpoint
in your API for every operation you want to run--
that is easier to make fast.
However, it's a little tricky in TensorFlow
because the current--
the pre-existing graph runtime has a few layers
of indirection between a node in the graph and the actual kernel
that it executes.
And indeed, between TensorFlow versions,
we can, without breaking graph def compatibility,
replace some kernels.
Some things that were handled by a single kernel
now have to be handled by many multiple kernels.
So to preserve this layer of indirection,
we felt like it was more natural to have this API that
is an open domain API.
However, as you can see, just by the fact
that there's a string in there, executing
this can only be so fast, because we
need to somehow take this string and these attributes
and some properties about these inputs
and turn that into a kernel.
And that means that we can definitely not
execute kernels any faster than it takes
to at least hash that string.
So you know, there are trade-offs here,
but we felt that preserving the flexibility
that you have in graph mode was the most important thing here.
So how do you actually use execute?
When you call this line in Python, what actually happens?
And execute's something that is defined in the TensorFlow C
API, and to use it, you do something like this.
You first create a status so that you
can find out if things failed, and then you make a new op.
You add inputs, you set attributes,
you allocate some memory to put pointers to the return values,
and then you call execute and finally, you delete that all.
So this is fairly straightforward,
and if you're familiar with the TensorFlow C API for building
graphs, you'll see that this is very similar to that C API.
There's-- this is about as good as you could possibly get
for Python code in an open domain setup,
but it's a little sad here that when you build a TFE_Op,
you need to add the inputs to that op.
This means that if you're executing
an op in a tight loop, you can't exe--
you have to have the same inputs on every iteration of the loop
either/or you have to allocate a whole other TFE_op.
For Python, we really don't have a way of making this better,
but for other languages like Swift or languages
that have access to our compiler,
we should be able to cache the dynamic bits that
involve making a TFE_Op and separate them from the--
sorry, cache the static bits that don't really change,
like all your MatMuls are the same,
and separate that from the dynamic bits, which
are the inputs that should actually change.
And if you do this, you can actually--
we could make in principle this open domain
approach as fast as a closed domain approach.
And this is maybe a minor refactoring
that we should do at some point.
So what does execute do?
If you go through a few layers of APIs,
you end up on this function called EagerExecute,
and it has a lot of things here.
The first interesting one is this maybe update_op device,
which is, you might call it, the placer,
where we get to decide where we're going
to execute each operation.
This will have some complicated heuristics.
In general, you can think of it as, if you're
operating on a resource tensor, we'll
run your operation on the device that
has that resource because any other thing will fail.
Otherwise, if you have a tf.device annotation somewhere,
we'll run it there.
Otherwise, if you don't have any of those things,
we'll see what devices are available to execute
this operation and run on whatever device
we think is going to be the fastest, which
is how TensorFlow gets away with using a GPU for you
even if you don't specify with tf.device GPU in there.
And then you have some forks in there about are we local,
are we remote?
And once you do know that you're in the local case, what we want
to do is very quickly do that string processing
that we needed to do to find what
kernel we should be executing.
And there is a fast code that takes the attributes
in the name of an op and gives you
a cache key that is looked entirely in the context
and where we store the kernel.
And here you might think something's a little funny,
because usually you think of operations as functions,
not as classes, but clearly there
is like a kernel and device class in there,
so we probably have an object.
And the reason is that for many types
of kernels that we want to execute,
especially things involving GPUs but also some stateful kernels,
you want to keep some state in that object.
Ideally, that state will be managed by the runtime
if you have the resource managers,
but that doesn't always happen now.
And once you have a kernel, the kernel
tells us on what devices it wants its inputs on.
You would think that the kernel would want its inputs
all in a device that is executing,
but that turns out to be too narrow a view.
For some kernels, especially GPU kernels,
you might want some inputs on the host
CPU that's attached to the GPU.
And the reason is that imagine you're
a GPU kernel for generating a big matrix of random numbers.
You'd like to know how many numbers you're
going to generate so that you can run the memory
allocation before you enqueue your CUDA kernel.
So if you were to do that-- if the shape of the random number
vector you're going to generate, if that's in the GPU,
you'd need to fetch it back to the CPU to do that allocation.
That would be terribly slow.
So instead TensorFlow says, I expect that input
to be on the local CPU, and this is the function
that validates this.
But in this case, if you're also trying
to run a large convolution and one of the inputs
is in the CPU, this maybe copy will move that input
to the GPU for you, which is faster than just running
your convolution on the CPU.
And then finally, we get to decide whether we're
in sync or async mode, where we first create this node that
represents all the computation that has to happen
to execute this kernel.
If we're async mode, we throw this in a queue and return
control immediately.
If we're in sync mode, we run it now.
This async/sync here is complicated
because it's another layer of asynchrony that's
separate from the fact that our GPU runtime is itself
asynchronous.
This is kind of a patch to make the TensorFlow CPU
runtime, which is currently synchronous,
act asynchronously to try to get a little more performance
in the eager mode.
It's a little sad, because you lose your error messages
once you get very asynchronous there,
and we currently do not run shape_inference
in this asynchronous mode.
I think as we rework the TensorFlow runtime, which
the team has a large effort to do now,
we have a chance to fix this and have a single code
back for synchronous and asynchronous.
But for now, we have this patch.
And then finally, now we have the kernel.
We've got to call it.
That's easy, right?
So to call a kernel, we have to make an OpKernel context,
and to do that, you need to fill this brand struct, which
I put here.
Which you can clearly read in this light,
because it can definitely fit with a very large and readable
font.
So we don't do that.
This is sadly something that-- the original TensorFlow
API for kernels had only one caller, which was a TensorFlow
executor, so it was very easy to just add parameters and make
the calling convention harder and harder,
because there was only one place to fix.
We're now trying to trim this back and simplify it,
so it will likely get better.
But for the eager runtime, we have this class KernelAndDevice
that knows how to call a kernel, requiring a lot fewer things
about it.
Mostly all it needs is the inputs--
a place for you to populate with outputs--
and some information in case you want to profile--
things about how long it takes to execute each node
or do a staff or what graphs you're
using, if you're executing function, and things like that.
So now that we have this, we can run the kernel.
So what the kernels look like--
ReLU happens to have one of the simpler kernels
we have in TensorFlow.
It's a UnaryElementWiseOp.
We have a base class for this that
handles a lot of the logic around memory
allocation, buffer reuse, so that tf.relu by default
will reuse its input buffer if the TensorFlow runtime knows
that no other op yet to be executed is going to need this.
But once all the boilerplate is dealt with,
all that this kernel has to do is execute the functor.
And this is another place where you'd be surprised
that we use an object where you think we should use a function,
because in principle, relu is just a function.
It doesn't keep any state.
There should be no reason for us to make an object for it.
Except C++ does not let you declare a templated function
in the header file, but define it in the C++ file.
And this is something very useful for us
because as you can see, device is a parameter in there,
and one of those devices is GPU devices.
And for GPU devices, we'd like to put the function in the file
that we're going to compile for CUDA compiler.
And we would like to not compile our entire code
base with a CUDA compiler.
So being able to define this functor in a single place,
where we can generate a whole specialization of this class--
we don't have any access to CUDA compiler,
but have a file on the side that's
just going to fill in this implementation
after running the CUDA compiler.
It's very useful.
And as you can also see here, TensorFlow
kernels-- they tend to be highly templated.
Most are templated like this one, based on the device
that it's going to execute on and on the dtypes.
So they will generate fast, specialized code
for every core numerical type supported by TensorFlow, which
is an incentive to keeping the set of core numerical types
supported by TensorFlow relatively small,
as otherwise their binary size would grow.
But this has the nice side effect
that the code generated is very fast.
And one of the things that makes us generate
very fast code for this, which you will see if you look
into the implementation of the functors,
is that we can use Eigen to generate this code for us.
So the ReLU functor in particular
is very easy to write, because it's
just a computed wise max between the tensor you
have your input and 0.
And Eigen turns out to be a very useful tool to write this.
It lets us write this code once.
It will generate specializations for every dtype
we are interested in and also for CPUs and GPUs.
For this particular operation, you could probably
write it in fast assembly language yourself,
or SSE intrinsics or something like that,
but for more complicated operations,
like softmax and others that might
have interesting intermediate values that need computing,
being able to just have this code be generated
for you instead of requiring that you write all the device
specific and type specific things, can save a lot of time.
Also, Eigen in its core has a very, very fast gem,
which is the core, like, basic MatMul that is inside most
of our very expensive kernels.
It ends up being a very large asset
in making TensorFlow go fast.
So that was it, really.
It only took us what, 20 minutes to get through executing ReLU?
I think TensorFlow can do it a little bit faster than that.
[LAUGHTER]
But in general, this is kind of like what the stack of things
looks like as you're executing operations eagerly.
Of course, if you've been following
this for a little bit, you should
know that we can do better than executing operations eagerly
in TensorFlow.
We have tf.function and we have graphs and other things that
can get you a lot more performance by instead
of going down and up that stack for every single operation,
going down once and executing a lot of operations.
Also, this slides into optimizations and stuff.
So how do we run functions?
And so-- yes?
AUDIENCE: Previously, you mentioned briefly
about the async mode.
So is that something that is user configurable?
Because there's like context async, [INAUDIBLE]..
ALEX PASSOS: I don't remember right now if there's
a public API to make it user configurable or not,
but there is an internal API to make it user configurable.
AUDIENCE: I believe that there was-- in enable Eager
Execution, you could set it.
So I think you could set it in v1,
but it not be exposed directly in v2.
ALEX PASSOS: Yes.
You-- Taylor is correct.
I think it-- I know how to expose it in v1
and do not know how to expose it in v2,
but there's probably a way.
Or maybe there's not a way yet because we're still
treating it as experimental.
I'm not sure.
Regardless, the way it is now, I don't think it's
something we should rely on in the long run.
So I'd rather be able to iterate on it
a little longer until we start recommending it
as a way for people to get performance.
AUDIENCE: And what is the difference between eager
slow and fast path?
ALEX PASSOS: It's most that the fast path has
special case-- some types of tensor conversion
into fast C code.
So if you pass a list of Python floating point numbers,
we can convert that to an eager tensor
without hitting any Python code, and that
will save you quite some time.
OK, so most of what we saw in right now
for the case of executing a single op
also applies to executing a function.
So a function itself is an op named PartitionedCall,
and you will execute that op--
like the tf.function internals will execute that op--
just like how we just saw how to execute ReLU.
And so the first half, until you get to the kernel and device
run bit, is all the same.
It's just that that kernel implementation
is particularly interesting.
In function calls in TensorFlow, they look relatively simple.
We have inputs, we have outputs, we
have attributes that tell TensorFlow what types of inputs
we have, what types of outputs we have,
and we have a function.
There are some things in there that
seem a little tricky, like there's
all sorts of configuration we can pass.
I actually forgot what's the difference between config
and config_proto in there.
But essentially, this is the big entry point
to executing functions in TensorFlow.
But what this-- if you go look at the kernel of this op, what
you'll find is that it mostly just forwards things
to the FunctionLibraryRuntime.
And the FunctionLibraryRuntime is this core bit of TensorFlow
that knows about functions.
It can do things like instantiate and run,
pretty much.
And also create kernel, which you usually
do between instantiate and run, since that will let you--
for the function of a runtime also knows how to execute
operations that are not functions.
So what does instantiate mean?
An instantiate mostly runs all the graph
optimizations that we might want to run on that function--
to take code that you enjoyed writing
and turn it into code that the executor will
execute very quickly.
Most of this processing happens in this process
FunctionLibraryRuntime Instantiate multi-device call,
where we run all sorts of graph transformations.
This is if you have a tf-xla bridge happening,
it will run the transformations related to tf-xla bridge.
It will run the TensorFlow placer.
What the TensorFlow placer does is it takes a graph--
in this case, a function graph--
that has devices assigned to some of the nodes,
and it spits out another graph that has devices
assigned to all of the nodes.
It does this by following a very similar algorithm to the one
that I described earlier for individual ops.
So if you have a resource, it will place that op
next to the resource.
Otherwise, if you have specified a device,
it will respect that device, even if you partially
specified a device, it will respect that partial device
specification.
And finally, we'll group things by colocation group.
The TensorFlow graph language allows
you to specify colocations, even though these
have very non-intuitive consequences, because
by colocating a node A of a node B,
you can actually move where node B
is placed because the placer is not
aware of the direction of the colocation arrows.
It just groups all the colocated nodes into a bag
and finds a device that can execute all ops in there.
So this can have very fun consequences,
like a bug I helped fix a while back
where if you try to speed your distributed system by always
colocating the variable initialization
code with the remote device in which the variable is in,
and you can accidentally say, please always colocate
the initializer from a GPU variables
on the GPU, which can be trouble.
If some of your initializers have operations that cannot run
in the GPU, you now have silently moved your variables
to the CPU, which probably is quite a performance
degradation.
So it's very subtle and in TF2, we're
trying to move away from using colocation constraints
inside TensorFlow and we're definitely moving away
from encouraging people to use colocation constraints
outside TensorFlow.
I would rather you be more specific about what devices
you're using, or even better, use something
like a distribution strategy that
is aware of all these bugs and colocations
and can work around them for you instead of trying to replicate
this functionality yourself.
And once you have placed the graph,
we can run the partition that takes a graph that
has nodes for many devices and returns many graphs,
all of which have nodes on a single device only.
And to do that, if there was any edge that went from one device
to another device, that gets replaced with a pair of sends
and receives.
This is also where we run Grappler and all the function
inlining and optimization passes that the last training section
with Eugene was covering.
But yeah.
So this does a lot of heavy lifting
and indeed, the partition call-up
puts the cache in front of Instantiate
to make sure that we don't call it twice
for the single function, because otherwise we'd be very slow.
And once they've instantiated the function,
we can go to the other main method
in the FunctionLibraryRuntime and run it.
So in general, as you can tell by the name
Partition Call for the op, our functions
can be on multiple devices.
And at this point, at least the code
has simplified enough in there--
this is actually snippeted from the core runtime,
even though the core runtime has a lot more error handling
going on--
that all we have to do to run a function on multiple devices
is to just run a function on a single--
run end functions each on a single device
and trust that they all know how to talk to each other
to make the sends and receives happen.
So there's some thing called a rendezvous,
and if you read the TensorFlow code base,
you will see lots of references to it, that's
responsible for making the sends and receives all the way
over each other.
We'll have rendezvous that know how to deal with single host
and with multi host.
And there are lots of tricky and interesting bits
into how they relate with what is
the correct lifetime of a rendezvous,
how do you shut down a rendezvous once you want
to shut down TensorFlow computation because maybe
some kernels failed.
And you know, some kernels failed
and something is blocking on receiving a tensor.
They're never going to get that tensor.
So we probably need to shut that operation down gracefully.
And there's a lot of cancellation related
logic in that.
But it mostly-- at the level of FunctionLibraryRuntime,
you can just run your end functions, one per device,
and forget about it.
And running our function on a single device
mostly consists of bringing up this TensorFlow executor
and calling run in it.
And you'll see that you have things named RunAsync and done
callbacks.
In general, we treat all of these things as asynchronous
so that we can release the calling thread as quickly as we
can so that you can keep on running more computation on it.
Especially if you have nested function calls--
treating these things asynchronously
is quite the performance improvement.
And here, I could dig into the TensorFlow executor,
but that code is fairly complex and you have a simple core
algorithm, but it's really hard to pull it out of there.
And I think the main reason why it's
hard to pull it out of there is that the executor grew together
with the implementation of control flow in TensorFlow,
and specifically, the details that we had to implement
while_loop kind of, like, obscured
a lot of the core functionality of the executor.
It now is aware of frames and a lot of other complicated
things, and you have multidimensional pending
counts.
So I'm not going to snippet that code,
but I'll say that if you want to read it, go for it.
It's very interesting-- like, highly asynchronous,
highly parallel interpreter.
But I'll just give you some of the highlights
of what's happening in there.
And its input is a large bag of nodes and there is no output.
Anything that you want to get out of it,
you get out of it through a send or a receive.
I mean, there's-- technically there are outputs,
but by far most outputs in the common case of TensorFlow are
handled through sends and receives.
And the end state for the executor
is that all nodes must executed or an error has happened.
And inside TensorFlow, the core runtime,
we have no error recovery.
So any error will shut everything down.
This is sometimes unfortunate, because some parts
of the TensorFlow's higher level APIs
rely on errors for the common path.
For example, tf.data raises an error
once it reaches the end of a stream, which
means that you can't really easily have a single graph that
exhausts a single iterator, does some things, and then runs
and other iterator.
Because by the time you've exhausted the first iterator,
an error is raised and TensorFlow will shut everything
down.
There are ways of interacting with tf.data that do not
involve using the iterator GetNext op, which can fail,
and we use those inside AutoGraph to make it easier
for you to write code that can recover from these failures
and--
well, not recover from these failures.
It will say no failures when iterating
over multiple iterators.
It's quite nice.
tf.data has all these cool little combinators
like [INAUDIBLE] and reduce and you can thread together
like three of those to simulate a while_loop of breaks.
But anyway, popping the stack here,
the core algorithm of the executor
is while there are some nodes that haven't been executed
and no errors have happened, execute a viable node,
and once that node finishes executing,
you mark all of its output tensors as ready.
And there's some bookkeeping in there
that once you mark a tensor as ready,
you look at what ops are going to be made executable
by marking the tensor as ready, which
marks other nodes as viable.
And this just, like, recursively applies.
In the executor itself, it's not a single thing.
It runs on every thread that is executing an op as soon as
that op finishes executing, and it dispatches all the execution
to another threadpool.
It's kind of surprising that this is happening,
because this means that TensorFlow cannot really be run
on a single thread.
But some interesting noteworthy things
about the executor that you might have
guessed from my comments so far, but require some thinking,
I think.
One is that the executor is greedy, not lazy.
And if you're familiar with TensorFlow, it looks very lazy,
but it mostly looks lazy because we do very aggressive graft
pruning.
And once you start putting control flow and function
execution and a few other things in the executor,
it actually pays off to have a mental model that
says the first pruning happens and then
greedy execution happens.
Otherwise, you can trick yourself
into thinking that some things are not
going to be executed when they are, in fact, going
to be executed.
Like, my favorite is if you have a conditional
and one of the branches of the conditional
depends on a value that is not in the conditional,
that value is unconditionally executed even if that branch is
never executed.
Which, if the executor were lazy,
that would not be the case.
But the executor being greedy also
makes it easier for you to be able to reason
about stateful operations, which is very nice, given
that those exist.
Another thing is that this executor,
it only looks at very local information.
Like, the only bit it has for each node is whether it's ready
or not.
So often, there is nothing preventing it
from choosing very suboptimal ordering of things to do.
Like if you need to fetch a lot of tensors
from a parameter server, the executor
is just as likely to fetch the first layer's tensor
as it is likely to fetch the last layer's tensor,
because none of these things have any dependencies on them.
And it can be quite tricky to teach TensorFlow
to actually choose the optimal ordering of things to execute.
And as I was saying earlier, this executor
is this thing that, there is no single executor thread.
It just runs on every thread as soon
as it finishes executing an op.
So it's kind of this highly parallel little monster.
So this is it for most of the core TF runtime.
I just had a few topics that I couldn't really
fit very well that I wanted to cover in this presentation,
just to generate some documentation.
One is hosts versus device memory.
As I hinted at earlier, when you partition TensorFlow,
it takes a graph and it spits out end graphs,
one graph per device.
Each graph per device gets its own executor.
But how do you deal with the fact
that some GPU ops take CPU tensors?
So we make a distinction between--
when you specify an input to a kernel,
you can say that that kernel expects that input
to be in host memory or expects that input
to be in device memory.
And so in fact, the executors for the GPU device can be,
and most of the time are, running a lot of CPU operations
and CPU tensors, only they call those tensors GPU tensors
in host memory.
And so if you look at the TensorFlow code,
you might sometimes see things like a distinction
between the device a tensor is in
and a device its memory is in and a device its operation is
in, and this bookkeeping is necessary to avoid
mixing these things up.
And incidentally, all resource handles are in host memory.
But this has a very sad, unintuitive consequence
that we need to fix, which I call it,
and I think other people call it,
just "TF int32 problem," which is that most of the things
that GPU ops take as host memories are shape related--
things like fill and zeros and RandomNormal,
they all take a shape and they fill that shape with whatever
values you want.
But the shapes are not static.
They're often computed based on other shapes.
The simplest case is when you just use zeros or something
like that, where you take a tensor, take its shape,
and use that shape to fill in another tensor.
But sometimes you're going to reduce
some dimensions in the shape, broadcast,
do some other things, and TF has this rule where by default, it
will place every GPU-capable op on a GPU device.
And if you want finer grained control,
you just take a large block of code in TF
and you wrap it with a tf.device, which
also means that every op that happens inside a block of code
gets placed on that device.
So if you allow TensorFlow to have
GPU kernels for int32 tensors, we
would keep bouncing these shapes between the GPU and the CPU.
So you take a shape and you want to slice it to, like,
remove the back of dimension.
We would copy it to the GPU, remove the back of dimension,
then copy it back to the CPU and use
it to fill the RandomNormal, and that's just sad.
So what we did instead in TensorFlow
to kind of paper over this, and this
has been [INAUDIBLE] because this would create
a lot of host-device transfers, and every time you have one,
you have to sync the GPU stream and you slow everything down.
So to avoid this, we say that for almost every op that
has a kernel registered in the GPU, that has int32 inputs
and outputs, those are force-placed in host memory.
Including things like plus, gather, reductions,
and other things that you'd like to use as part of a model.
Currently, the work-around is use int64 for types
that you actually want to get executed on the GPU
and use int32 only for your shapes.
We have a fix forthcoming.
We have code already that exists in both Grappler for graph mode
and the eager placer for eager mode
that uses some heuristics and estimates of the cost
of transfer versus cost of computation
to try to keep small integer computations on the CPU
where they belong instead of bouncing them
back and forth to the GPU.
But there's still some performance regressions
that prevent these things from being turned on by default.
We expect it to happen very soon.
So this is it.
If you have any questions?
AUDIENCE: Could you maybe comment
on the difference between the kernel cache and the caching
that we do for functions in the Python layer?
ALEX PASSOS: So the Python layer does a caching for functions
from the--
like, here is a Python function, all these metadata
that's completely invisible to the runtime,
to a concrete function definition.
That-- when you go execute that concrete function definition,
we admit our partition call-up, our stateful partition call-up,
that then hits the tfe execute.
And the first time the op has to execute,
there's a lot of initialization that has
to happen inside the runtime.
That initialization is mostly covered by--
a little bit of it's covered by the kernel cache that mostly
tries to figure out what device it's going to be in
and the details of the attributes
and things like that.
And then the first time you actually execute it,
we'd have the cache inside the FunctionLibraryRuntime, which
is the thing that guards Grappler
and all the other graph processing transformations
we have to do.
So it's a few layers of caches to make things fast.
I don't really know how it could possibly
merge these caches across these layers of the stack,
but maybe if we unify more things,
this is going to be possible.
AUDIENCE: And maybe it might be worth
also looking if you can comment a little bit about--
you said the kernel cache is something
that is very unique to the eager execution versus something
that we'd have in graph mode?
ALEX PASSOS: No, we already had the kernel cache in graph mode.
I mean, it has a slightly different kernel
cache in eager execution, but we already needed a kernel cache
in graph mode, too, because creating a kernel
might require allocating memory, which might be expensive.
It definitely requires allocating a vtable pointer.
But in some kernels, you have to allocate a lot more than that.
AUDIENCE: We just had some cases where
there was memory bloat because of the kernel cache.
Is there anything that users need to be aware of for that,
or is something that the runtime needs to--
ALEX PASSOS: Hopefully our users should not be aware of it.
There are some cases now where you have to be aware of it,
like if you're using the V1 random number generator ops,
the only way to reset the random seed
requires resetting the kernel cache because states
step into kernels.
As we move to the V2 random number ops,
we don't make this type of mistake,
and so the space overall taken by the kernel cache
should become much smaller.
I also think for the particular case of functions,
we should be able to garbage collect
that cache a little better.
Yeah.
OK.
Oh, yes.
AUDIENCE: So when XLA is enabled,
we don't have this shape int32 problems anymore, right?
ALEX PASSOS: XLA has a different notion of kernel
and computation, so by giving up the possibility
of having any dynamic shapes at all, it can effectively--
XLA only works when you can ConstantFold all the shape
tensors away, and then it doesn't
matter if you ConstantFolded them on the GPU
or ConstantFolded them on the CPU but they're not there.
This is only a problem when you have runtime dynamic shapes.
AUDIENCE: Can you comment on the garbage collection
in the runtime?
ALEX PASSOS: What part--
AUDIENCE: In terms of what is garbage collected
and what is that garbage--
ALEX PASSOS: Most of our runtime is ref counted, not
garbage collected.
AUDIENCE: Oh, ref counted.
ALEX PASSOS: Yeah.
If you-- there's a class in TensorFlow
correlated core refcount.h.
It's a base class for a ref counted pointer,
and we use that instead of SharedPtr
because it's a smaller memory footprint
and it has better cache locality behavior.
So you should be able to just read that and find
the places of the runtime that inherit from it,
and you can see the things which are ref counted.
AUDIENCE: But we currently have no garbage collection
for the kernel cache.
ALEX PASSOS: The kernel cache is not garbage collected, correct.
But almost everything that can be ref counted already
is ref counted.
A few things, like the kernel cache, are not, because it--
ref counting caches feels weird.
But in some cases, like when you're
caching the kernel for a function,
it actually makes sense to ref count it.
AUDIENCE: Can we put a limit on that kernel cache?
ALEX PASSOS: In principle we can do, yes.
It's, you know, memory versus performance tradeoff.
Assuming we are not dealing with the v1 random number ops,
because those, if you are evicted from the cache,
you now change the sequence of random numbers you would get
and that's pretty bad.
OK, thank you.
[APPLAUSE]
[MUSIC PLAYING]