Subtitles section Play video Print subtitles ALEX PASSOS: Hi, my name is Alex, and I'm here again, this time to talk about the TensorFlow eager execution runtime. This is a very broad topic and there are lots and lots of things we could cover, so I'm going to lightly graze many, many different parts of our code base. I'll give you a lot of function names and file names and things like that that you can use to familiarize yourself with something, but if you're in the room right now, by all means, ask questions. Like, this is-- there's some buffer time at the end to account for variability here. And I think the more we can maximize shared understanding of this stuff, the better. So the way I thought we could go about this is to do a very, very deep dive on what actually happens in TensorFlow, starting from TF2-- when you type a very simple line of code, in this case a tf.nn.relu off some Python numbers. And I think if you were to start doing this, probably the first thing you'd do is graph the TensorFlow code base to find where would we define ReLU. And if you do that, you will find that we have some definitions of ReLU in Keras, but you won't find a single definition of ReLU itself in core TensorFlow, and that might be a little surprising at first. It might put a damper on this whole, let's find out what actually happens when we run ReLU business, but the way ReLU comes from is because it's enough that we've implemented in C++ and we didn't need to put a complicated Python API around it. We can just generate the Python code to call ReLU. So the way it's actually defined, it's defined using the same mechanism we use to register all the ops in TensorFlow. So for every core operation in TensorFlow that's visible to the runtime, we have a registration that looks like this-- REGISTER_OP. It takes a name and you say how many inputs, how many outputs, what attributes it has. If you want to know more about attributes-- what things are allowed in there-- they are how we can make our ops polymorphic and have the same operation have different types of outputs, so different numbers of outputs and things like that. There is a lot of documentation about this in TensorFlow.org, if you search for how to define a new op. And another interesting thing in there is that we also register a shape_inference function. ReLU, thankfully, is one of the simplest ops we have-- it just has one input, one output, they have the same shape. So we can use a pre-built shape_inference function that just says the shape does not change. Other ops will have vastly more complicated shape_inference functions. And the nice thing is that we can run these functions offline for our building graphs without actually having the values of any tensors and still be able to prove things about the shapes of intermediate tenses and outputs of your computation. This is maybe the best tool we have for catching bugs now. So if you want to look at the shape_inference code, that's where you'd hook into. Now that we have that registration, we run some complicated code that generates Python code to actually call ReLU. And if you look in bazel-genfiles, files, you will find a file named gen_nn_ops.py and this file has the actual def ReLU that we call. And as you can see, it's not pretty and there's a lot of stuff going in there. The first line deals are dispatching so that we can define ReLU not just for normal tensors, but also optionally for sparse tensors and ragged tensors and other composite types. The second line has tf_export and what this does is define the TensorFlow public API. Every symbol that you get when you are using TensorFlow by tf dot something is defined somewhere from a tf_export decorator like this one. There will be a future video on how exactly this works and why we do things this way instead of relying on Python's normal, you know, name spacing mechanism. But you can probably guess that it's because TensorFlow is very complicated. But essentially, you'll see this, and this generated code for ReLU has a bunch of cases in it. That are roughly four. You have an eager fast path, an eager slow path, you have a graph mode path, and kind of a side hook for the symbolic execution. But here, let's focus on the eager paths. In the first one, the first thing that we're actually doing here is we're checking to see if we're in eager mode or not. And to do that, we look at this context thing. This context thing is part of the core of the TensorFlow v 2 runtime. It's the moral equivalent to the session, but it's longer lived than the session and represents more things. So what is it? From Python, the context is this class that's defined in a file called [? context.ui. ?] And it's a collection of a lot of things that your Python program needs to be aware to connect to the TensorFlow runtime. It stores things like, am I in eager mode or in graph mode? Or if someone used a with tf.device decorator, what device am I supposed to be executing code in? And it stores things like, what's your name scope and many other things. And all of this information that the context stores, in general-- the things that can change during the execution of a program-- they're all stored in ThreadLocal stacks. Usually stacks, because we have these nested things like with tf.device, with tf.device, with tf.device, so you'd like to be able to pop the stack to go back to where you were. And ThreadLocal because it's very important to us that a TensorFlow runtime itself be thread agnostic, so that if you write two threads and one is doing a reinforcement learning learner and the other's doing an agent that's talking to some game, when the agent wants to use its GPU, it shouldn't necessarily make the learner use the U, and vice versa. Providing some kind of resolution between the threads is what we felt like was the right way, so that at least each single thread can feel like it's its own single-threaded Python program. We use this a lot in distribution strategies, like MirroredStrategy uses a lot of threads under the hood, so it's really important that things are thread-safe. In the Python context, essentially it mostly is a wrapper around the C++ context. It's available for the TensorFlow C API. And this is the core thing. It has a bunch more methods than just these-- like, you can do a lot more things than just listing devices. One thing I'd like to call out is that right now, there are some things that are done in Python, like storing, whether in eager mode and graph mode, and some things that are done in C++. And it's not the set of things that are done Python and set of things that are done in C++ are likely to change. I think as TensorFlow evolves, more and more things should migrate from the Python context to the C++ context, which will make things more language agnostic, but also faster. And you know, if everything in the context was in C++, then all the generated Python code could have just been C++ code and it would have to-- we'd be able to get out of the overhead of executing Python much sooner and remove performance problems in our APIs. So once you know you're in eager mode, we try to do this fast path execution. In this, the fast path is some complicated C code that mostly does the same things that the fallback case is trying to do. So I don't think it's necessarily worth reading that. I would rather look at the simpler code and the fallback path. And it's here. So this is where we actually implement the ReLU function. And what are the interesting things here? First, we have this function called args_to_matching_eager, and what it does is it takes a bunch of tensors, a bunch of things that can be tensors, converts them all to tensors of the same dtype. And in the case of ReLU, there is only one. But in the case of any other ops, like Add or MatMul, they take multiple inputs, some of which might be tensors, others might be numpy.arrays or Python lists or variables or other objects that you can convert to a tensor, but they're not necessarily tensors. And this is the thing that's responsible for canonicalizing everything. Then once we have-- but here you might want to stop me a little and ask, what is a tensor at the level of a TensorFlow runtime? And the eager tensor class is a thing that's half implemented in Python and half implemented in the Python C API, but it's a relatively thin wrapper around this thing that we call TensorHandle that you can see in our TensorFlow C API. And this TensorHandle, it represents a potentially yet to be computer tensor which is going to live on some device. And we know what device it's going to live on, because we know what device we executed the operation that generates the tensor on. And there are roughly three things you can do to a TensorHandle, really. You can ask about its shape and dtype and stuff-- that's one. You can give it to execute-- that's another one. And you can copy it to a device. Also, I guess there are four things, and the fourth thing is you can ask, hey, what is its value? And when you want to force the evaluation of its value, the TensorFlow runtime might have to pause until it can give you the result. You might need to copy the tensor from some remote device and do all sorts of other things, but once you're done, you get a TF_Tensor, which is essentially just a pointer to some data. But every other operation that is not looking at the value of the tensor is something that the runtime is free to delay, reorder, as long as it respects the intended semantics of your program. And this is very important, because when you're working with things like GPUs and TPUs, you'd like to be able to dispatch operations that are running the hardware as quickly as you can in a way that's asynchronous. If the Python thread can race ahead of the GPU as it's executing operations, we can fully utilize the GPU hardware and get the maximum performance. And even if the eager runtime, we can do this if the GPU kernels are sufficiently heavy. There's a few cases in which we need to copy tensors, even if they're local on the CPU. And some of those cases are going away, like currently, we copy string tensors every time we try to look at their values because TensorFlow's internal string representation is a complicated C++ class and we'd like to have an API stable C type. We're actually changing our internal string representation-- there's an RFC you can look about it-- that will make the internal and external representations both be an API stable C type. But internally, what is a TensorHandle? And this is, in fact, very, very similar to the first implementation of TensorHandle. Now it's like hundreds of lines of code spread across many files, but all that you really need in a TensorHandle is to know what device it's in and what data you have. And the data can either be some concrete tensor, a value that has already been computed, or a future that might be computed later because it's remote or asynchronous or on some other device. But the core of it is this-- just, this is the future that we handle around in the representation. The current code is much more complicated, and you might want to look at it to see why it's complicated, but the core idea is there. So popping the stack. You now know what a tensor is, and you've probably figured out that converting that list of Python integers to a tensor is not terribly hard-- there's C code that does that-- and now it's time for us to execute ReLU. Great. Here is one thing to note about how the TensorFlow eager runtime works, which is that this line that is selected there is not a noncontroversial choice. Overall, there are two ways we could have gone about this. We could have had a closed domain API, in which TensorFlow would export a symbol called ReLU, another called MatMul, another called conf, et cetera. Or we could have this open domain API that we have here, where we have a symbol called execute that takes the name of an operation ReLU and a bunch of metadata about it. There are advantages and disadvantages to both cases. In general, the closed domain case, where you just have a single endpoint in your API for every operation you want to run-- that is easier to make fast. However, it's a little tricky in TensorFlow because the current-- the pre-existing graph runtime has a few layers of indirection between a node in the graph and the actual kernel that it executes. And indeed, between TensorFlow versions, we can, without breaking graph def compatibility, replace some kernels. Some things that were handled by a single kernel now have to be handled by many multiple kernels. So to preserve this layer of indirection, we felt like it was more natural to have this API that is an open domain API. However, as you can see, just by the fact that there's a string in there, executing this can only be so fast, because we need to somehow take this string and these attributes and some properties about these inputs and turn that into a kernel. And that means that we can definitely not execute kernels any faster than it takes to at least hash that string. So you know, there are trade-offs here, but we felt that preserving the flexibility that you have in graph mode was the most important thing here. So how do you actually use execute? When you call this line in Python, what actually happens? And execute's something that is defined in the TensorFlow C API, and to use it, you do something like this. You first create a status so that you can find out if things failed, and then you make a new op. You add inputs, you set attributes, you allocate some memory to put pointers to the return values, and then you call execute and finally, you delete that all. So this is fairly straightforward, and if you're familiar with the TensorFlow C API for building graphs, you'll see that this is very similar to that C API. There's-- this is about as good as you could possibly get for Python code in an open domain setup, but it's a little sad here that when you build a TFE_Op, you need to add the inputs to that op. This means that if you're executing an op in a tight loop, you can't exe-- you have to have the same inputs on every iteration of the loop either/or you have to allocate a whole other TFE_op. For Python, we really don't have a way of making this better, but for other languages like Swift or languages that have access to our compiler, we should be able to cache the dynamic bits that involve making a TFE_Op and separate them from the-- sorry, cache the static bits that don't really change, like all your MatMuls are the same, and separate that from the dynamic bits, which are the inputs that should actually change. And if you do this, you can actually-- we could make in principle this open domain approach as fast as a closed domain approach. And this is maybe a minor refactoring that we should do at some point. So what does execute do? If you go through a few layers of APIs, you end up on this function called EagerExecute, and it has a lot of things here. The first interesting one is this maybe update_op device, which is, you might call it, the placer, where we get to decide where we're going to execute each operation. This will have some complicated heuristics. In general, you can think of it as, if you're operating on a resource tensor, we'll run your operation on the device that has that resource because any other thing will fail. Otherwise, if you have a tf.device annotation somewhere, we'll run it there. Otherwise, if you don't have any of those things, we'll see what devices are available to execute this operation and run on whatever device we think is going to be the fastest, which is how TensorFlow gets away with using a GPU for you even if you don't specify with tf.device GPU in there. And then you have some forks in there about are we local, are we remote? And once you do know that you're in the local case, what we want to do is very quickly do that string processing that we needed to do to find what kernel we should be executing. And there is a fast code that takes the attributes in the name of an op and gives you a cache key that is looked entirely in the context and where we store the kernel. And here you might think something's a little funny, because usually you think of operations as functions, not as classes, but clearly there is like a kernel and device class in there, so we probably have an object. And the reason is that for many types of kernels that we want to execute, especially things involving GPUs but also some stateful kernels, you want to keep some state in that object. Ideally, that state will be managed by the runtime if you have the resource managers, but that doesn't always happen now. And once you have a kernel, the kernel tells us on what devices it wants its inputs on. You would think that the kernel would want its inputs all in a device that is executing, but that turns out to be too narrow a view. For some kernels, especially GPU kernels, you might want some inputs on the host CPU that's attached to the GPU. And the reason is that imagine you're a GPU kernel for generating a big matrix of random numbers. You'd like to know how many numbers you're going to generate so that you can run the memory allocation before you enqueue your CUDA kernel. So if you were to do that-- if the shape of the random number vector you're going to generate, if that's in the GPU, you'd need to fetch it back to the CPU to do that allocation. That would be terribly slow. So instead TensorFlow says, I expect that input to be on the local CPU, and this is the function that validates this. But in this case, if you're also trying to run a large convolution and one of the inputs is in the CPU, this maybe copy will move that input to the GPU for you, which is faster than just running your convolution on the CPU. And then finally, we get to decide whether we're in sync or async mode, where we first create this node that represents all the computation that has to happen to execute this kernel. If we're async mode, we throw this in a queue and return control immediately. If we're in sync mode, we run it now. This async/sync here is complicated because it's another layer of asynchrony that's separate from the fact that our GPU runtime is itself asynchronous. This is kind of a patch to make the TensorFlow CPU runtime, which is currently synchronous, act asynchronously to try to get a little more performance in the eager mode. It's a little sad, because you lose your error messages once you get very asynchronous there, and we currently do not run shape_inference in this asynchronous mode. I think as we rework the TensorFlow runtime, which the team has a large effort to do now, we have a chance to fix this and have a single code back for synchronous and asynchronous. But for now, we have this patch. And then finally, now we have the kernel. We've got to call it. That's easy, right? So to call a kernel, we have to make an OpKernel context, and to do that, you need to fill this brand struct, which I put here. Which you can clearly read in this light, because it can definitely fit with a very large and readable font. So we don't do that. This is sadly something that-- the original TensorFlow API for kernels had only one caller, which was a TensorFlow executor, so it was very easy to just add parameters and make the calling convention harder and harder, because there was only one place to fix. We're now trying to trim this back and simplify it, so it will likely get better. But for the eager runtime, we have this class KernelAndDevice that knows how to call a kernel, requiring a lot fewer things about it. Mostly all it needs is the inputs-- a place for you to populate with outputs-- and some information in case you want to profile-- things about how long it takes to execute each node or do a staff or what graphs you're using, if you're executing function, and things like that. So now that we have this, we can run the kernel. So what the kernels look like-- ReLU happens to have one of the simpler kernels we have in TensorFlow. It's a UnaryElementWiseOp. We have a base class for this that handles a lot of the logic around memory allocation, buffer reuse, so that tf.relu by default will reuse its input buffer if the TensorFlow runtime knows that no other op yet to be executed is going to need this. But once all the boilerplate is dealt with, all that this kernel has to do is execute the functor. And this is another place where you'd be surprised that we use an object where you think we should use a function, because in principle, relu is just a function. It doesn't keep any state. There should be no reason for us to make an object for it. Except C++ does not let you declare a templated function in the header file, but define it in the C++ file. And this is something very useful for us because as you can see, device is a parameter in there, and one of those devices is GPU devices. And for GPU devices, we'd like to put the function in the file that we're going to compile for CUDA compiler. And we would like to not compile our entire code base with a CUDA compiler. So being able to define this functor in a single place, where we can generate a whole specialization of this class-- we don't have any access to CUDA compiler, but have a file on the side that's just going to fill in this implementation after running the CUDA compiler. It's very useful. And as you can also see here, TensorFlow kernels-- they tend to be highly templated. Most are templated like this one, based on the device that it's going to execute on and on the dtypes. So they will generate fast, specialized code for every core numerical type supported by TensorFlow, which is an incentive to keeping the set of core numerical types supported by TensorFlow relatively small, as otherwise their binary size would grow. But this has the nice side effect that the code generated is very fast. And one of the things that makes us generate very fast code for this, which you will see if you look into the implementation of the functors, is that we can use Eigen to generate this code for us. So the ReLU functor in particular is very easy to write, because it's just a computed wise max between the tensor you have your input and 0. And Eigen turns out to be a very useful tool to write this. It lets us write this code once. It will generate specializations for every dtype we are interested in and also for CPUs and GPUs. For this particular operation, you could probably write it in fast assembly language yourself, or SSE intrinsics or something like that, but for more complicated operations, like softmax and others that might have interesting intermediate values that need computing, being able to just have this code be generated for you instead of requiring that you write all the device specific and type specific things, can save a lot of time. Also, Eigen in its core has a very, very fast gem, which is the core, like, basic MatMul that is inside most of our very expensive kernels. It ends up being a very large asset in making TensorFlow go fast. So that was it, really. It only took us what, 20 minutes to get through executing ReLU? I think TensorFlow can do it a little bit faster than that. [LAUGHTER] But in general, this is kind of like what the stack of things looks like as you're executing operations eagerly. Of course, if you've been following this for a little bit, you should know that we can do better than executing operations eagerly in TensorFlow. We have tf.function and we have graphs and other things that can get you a lot more performance by instead of going down and up that stack for every single operation, going down once and executing a lot of operations. Also, this slides into optimizations and stuff. So how do we run functions? And so-- yes? AUDIENCE: Previously, you mentioned briefly about the async mode. So is that something that is user configurable? Because there's like context async, [INAUDIBLE].. ALEX PASSOS: I don't remember right now if there's a public API to make it user configurable or not, but there is an internal API to make it user configurable. AUDIENCE: I believe that there was-- in enable Eager Execution, you could set it. So I think you could set it in v1, but it not be exposed directly in v2. ALEX PASSOS: Yes. You-- Taylor is correct. I think it-- I know how to expose it in v1 and do not know how to expose it in v2, but there's probably a way. Or maybe there's not a way yet because we're still treating it as experimental. I'm not sure. Regardless, the way it is now, I don't think it's something we should rely on in the long run. So I'd rather be able to iterate on it a little longer until we start recommending it as a way for people to get performance. AUDIENCE: And what is the difference between eager slow and fast path? ALEX PASSOS: It's most that the fast path has special case-- some types of tensor conversion into fast C code. So if you pass a list of Python floating point numbers, we can convert that to an eager tensor without hitting any Python code, and that will save you quite some time. OK, so most of what we saw in right now for the case of executing a single op also applies to executing a function. So a function itself is an op named PartitionedCall, and you will execute that op-- like the tf.function internals will execute that op-- just like how we just saw how to execute ReLU. And so the first half, until you get to the kernel and device run bit, is all the same. It's just that that kernel implementation is particularly interesting. In function calls in TensorFlow, they look relatively simple. We have inputs, we have outputs, we have attributes that tell TensorFlow what types of inputs we have, what types of outputs we have, and we have a function. There are some things in there that seem a little tricky, like there's all sorts of configuration we can pass. I actually forgot what's the difference between config and config_proto in there. But essentially, this is the big entry point to executing functions in TensorFlow. But what this-- if you go look at the kernel of this op, what you'll find is that it mostly just forwards things to the FunctionLibraryRuntime. And the FunctionLibraryRuntime is this core bit of TensorFlow that knows about functions. It can do things like instantiate and run, pretty much. And also create kernel, which you usually do between instantiate and run, since that will let you-- for the function of a runtime also knows how to execute operations that are not functions. So what does instantiate mean? An instantiate mostly runs all the graph optimizations that we might want to run on that function-- to take code that you enjoyed writing and turn it into code that the executor will execute very quickly. Most of this processing happens in this process FunctionLibraryRuntime Instantiate multi-device call, where we run all sorts of graph transformations. This is if you have a tf-xla bridge happening, it will run the transformations related to tf-xla bridge. It will run the TensorFlow placer. What the TensorFlow placer does is it takes a graph-- in this case, a function graph-- that has devices assigned to some of the nodes, and it spits out another graph that has devices assigned to all of the nodes. It does this by following a very similar algorithm to the one that I described earlier for individual ops. So if you have a resource, it will place that op next to the resource. Otherwise, if you have specified a device, it will respect that device, even if you partially specified a device, it will respect that partial device specification. And finally, we'll group things by colocation group. The TensorFlow graph language allows you to specify colocations, even though these have very non-intuitive consequences, because by colocating a node A of a node B, you can actually move where node B is placed because the placer is not aware of the direction of the colocation arrows. It just groups all the colocated nodes into a bag and finds a device that can execute all ops in there. So this can have very fun consequences, like a bug I helped fix a while back where if you try to speed your distributed system by always colocating the variable initialization code with the remote device in which the variable is in, and you can accidentally say, please always colocate the initializer from a GPU variables on the GPU, which can be trouble. If some of your initializers have operations that cannot run in the GPU, you now have silently moved your variables to the CPU, which probably is quite a performance degradation. So it's very subtle and in TF2, we're trying to move away from using colocation constraints inside TensorFlow and we're definitely moving away from encouraging people to use colocation constraints outside TensorFlow. I would rather you be more specific about what devices you're using, or even better, use something like a distribution strategy that is aware of all these bugs and colocations and can work around them for you instead of trying to replicate this functionality yourself. And once you have placed the graph, we can run the partition that takes a graph that has nodes for many devices and returns many graphs, all of which have nodes on a single device only. And to do that, if there was any edge that went from one device to another device, that gets replaced with a pair of sends and receives. This is also where we run Grappler and all the function inlining and optimization passes that the last training section with Eugene was covering. But yeah. So this does a lot of heavy lifting and indeed, the partition call-up puts the cache in front of Instantiate to make sure that we don't call it twice for the single function, because otherwise we'd be very slow. And once they've instantiated the function, we can go to the other main method in the FunctionLibraryRuntime and run it. So in general, as you can tell by the name Partition Call for the op, our functions can be on multiple devices. And at this point, at least the code has simplified enough in there-- this is actually snippeted from the core runtime, even though the core runtime has a lot more error handling going on-- that all we have to do to run a function on multiple devices is to just run a function on a single-- run end functions each on a single device and trust that they all know how to talk to each other to make the sends and receives happen. So there's some thing called a rendezvous, and if you read the TensorFlow code base, you will see lots of references to it, that's responsible for making the sends and receives all the way over each other. We'll have rendezvous that know how to deal with single host and with multi host. And there are lots of tricky and interesting bits into how they relate with what is the correct lifetime of a rendezvous, how do you shut down a rendezvous once you want to shut down TensorFlow computation because maybe some kernels failed. And you know, some kernels failed and something is blocking on receiving a tensor. They're never going to get that tensor. So we probably need to shut that operation down gracefully. And there's a lot of cancellation related logic in that. But it mostly-- at the level of FunctionLibraryRuntime, you can just run your end functions, one per device, and forget about it. And running our function on a single device mostly consists of bringing up this TensorFlow executor and calling run in it. And you'll see that you have things named RunAsync and done callbacks. In general, we treat all of these things as asynchronous so that we can release the calling thread as quickly as we can so that you can keep on running more computation on it. Especially if you have nested function calls-- treating these things asynchronously is quite the performance improvement. And here, I could dig into the TensorFlow executor, but that code is fairly complex and you have a simple core algorithm, but it's really hard to pull it out of there. And I think the main reason why it's hard to pull it out of there is that the executor grew together with the implementation of control flow in TensorFlow, and specifically, the details that we had to implement while_loop kind of, like, obscured a lot of the core functionality of the executor. It now is aware of frames and a lot of other complicated things, and you have multidimensional pending counts. So I'm not going to snippet that code, but I'll say that if you want to read it, go for it. It's very interesting-- like, highly asynchronous, highly parallel interpreter. But I'll just give you some of the highlights of what's happening in there. And its input is a large bag of nodes and there is no output. Anything that you want to get out of it, you get out of it through a send or a receive. I mean, there's-- technically there are outputs, but by far most outputs in the common case of TensorFlow are handled through sends and receives. And the end state for the executor is that all nodes must executed or an error has happened. And inside TensorFlow, the core runtime, we have no error recovery. So any error will shut everything down. This is sometimes unfortunate, because some parts of the TensorFlow's higher level APIs rely on errors for the common path. For example, tf.data raises an error once it reaches the end of a stream, which means that you can't really easily have a single graph that exhausts a single iterator, does some things, and then runs and other iterator. Because by the time you've exhausted the first iterator, an error is raised and TensorFlow will shut everything down. There are ways of interacting with tf.data that do not involve using the iterator GetNext op, which can fail, and we use those inside AutoGraph to make it easier for you to write code that can recover from these failures and-- well, not recover from these failures. It will say no failures when iterating over multiple iterators. It's quite nice. tf.data has all these cool little combinators like [INAUDIBLE] and reduce and you can thread together like three of those to simulate a while_loop of breaks. But anyway, popping the stack here, the core algorithm of the executor is while there are some nodes that haven't been executed and no errors have happened, execute a viable node, and once that node finishes executing, you mark all of its output tensors as ready. And there's some bookkeeping in there that once you mark a tensor as ready, you look at what ops are going to be made executable by marking the tensor as ready, which marks other nodes as viable. And this just, like, recursively applies. In the executor itself, it's not a single thing. It runs on every thread that is executing an op as soon as that op finishes executing, and it dispatches all the execution to another threadpool. It's kind of surprising that this is happening, because this means that TensorFlow cannot really be run on a single thread. But some interesting noteworthy things about the executor that you might have guessed from my comments so far, but require some thinking, I think. One is that the executor is greedy, not lazy. And if you're familiar with TensorFlow, it looks very lazy, but it mostly looks lazy because we do very aggressive graft pruning. And once you start putting control flow and function execution and a few other things in the executor, it actually pays off to have a mental model that says the first pruning happens and then greedy execution happens. Otherwise, you can trick yourself into thinking that some things are not going to be executed when they are, in fact, going to be executed. Like, my favorite is if you have a conditional and one of the branches of the conditional depends on a value that is not in the conditional, that value is unconditionally executed even if that branch is never executed. Which, if the executor were lazy, that would not be the case. But the executor being greedy also makes it easier for you to be able to reason about stateful operations, which is very nice, given that those exist. Another thing is that this executor, it only looks at very local information. Like, the only bit it has for each node is whether it's ready or not. So often, there is nothing preventing it from choosing very suboptimal ordering of things to do. Like if you need to fetch a lot of tensors from a parameter server, the executor is just as likely to fetch the first layer's tensor as it is likely to fetch the last layer's tensor, because none of these things have any dependencies on them. And it can be quite tricky to teach TensorFlow to actually choose the optimal ordering of things to execute. And as I was saying earlier, this executor is this thing that, there is no single executor thread. It just runs on every thread as soon as it finishes executing an op. So it's kind of this highly parallel little monster. So this is it for most of the core TF runtime. I just had a few topics that I couldn't really fit very well that I wanted to cover in this presentation, just to generate some documentation. One is hosts versus device memory. As I hinted at earlier, when you partition TensorFlow, it takes a graph and it spits out end graphs, one graph per device. Each graph per device gets its own executor. But how do you deal with the fact that some GPU ops take CPU tensors? So we make a distinction between-- when you specify an input to a kernel, you can say that that kernel expects that input to be in host memory or expects that input to be in device memory. And so in fact, the executors for the GPU device can be, and most of the time are, running a lot of CPU operations and CPU tensors, only they call those tensors GPU tensors in host memory. And so if you look at the TensorFlow code, you might sometimes see things like a distinction between the device a tensor is in and a device its memory is in and a device its operation is in, and this bookkeeping is necessary to avoid mixing these things up. And incidentally, all resource handles are in host memory. But this has a very sad, unintuitive consequence that we need to fix, which I call it, and I think other people call it, just "TF int32 problem," which is that most of the things that GPU ops take as host memories are shape related-- things like fill and zeros and RandomNormal, they all take a shape and they fill that shape with whatever values you want. But the shapes are not static. They're often computed based on other shapes. The simplest case is when you just use zeros or something like that, where you take a tensor, take its shape, and use that shape to fill in another tensor. But sometimes you're going to reduce some dimensions in the shape, broadcast, do some other things, and TF has this rule where by default, it will place every GPU-capable op on a GPU device. And if you want finer grained control, you just take a large block of code in TF and you wrap it with a tf.device, which also means that every op that happens inside a block of code gets placed on that device. So if you allow TensorFlow to have GPU kernels for int32 tensors, we would keep bouncing these shapes between the GPU and the CPU. So you take a shape and you want to slice it to, like, remove the back of dimension. We would copy it to the GPU, remove the back of dimension, then copy it back to the CPU and use it to fill the RandomNormal, and that's just sad. So what we did instead in TensorFlow to kind of paper over this, and this has been [INAUDIBLE] because this would create a lot of host-device transfers, and every time you have one, you have to sync the GPU stream and you slow everything down. So to avoid this, we say that for almost every op that has a kernel registered in the GPU, that has int32 inputs and outputs, those are force-placed in host memory. Including things like plus, gather, reductions, and other things that you'd like to use as part of a model. Currently, the work-around is use int64 for types that you actually want to get executed on the GPU and use int32 only for your shapes. We have a fix forthcoming. We have code already that exists in both Grappler for graph mode and the eager placer for eager mode that uses some heuristics and estimates of the cost of transfer versus cost of computation to try to keep small integer computations on the CPU where they belong instead of bouncing them back and forth to the GPU. But there's still some performance regressions that prevent these things from being turned on by default. We expect it to happen very soon. So this is it. If you have any questions? AUDIENCE: Could you maybe comment on the difference between the kernel cache and the caching that we do for functions in the Python layer? ALEX PASSOS: So the Python layer does a caching for functions from the-- like, here is a Python function, all these metadata that's completely invisible to the runtime, to a concrete function definition. That-- when you go execute that concrete function definition, we admit our partition call-up, our stateful partition call-up, that then hits the tfe execute. And the first time the op has to execute, there's a lot of initialization that has to happen inside the runtime. That initialization is mostly covered by-- a little bit of it's covered by the kernel cache that mostly tries to figure out what device it's going to be in and the details of the attributes and things like that. And then the first time you actually execute it, we'd have the cache inside the FunctionLibraryRuntime, which is the thing that guards Grappler and all the other graph processing transformations we have to do. So it's a few layers of caches to make things fast. I don't really know how it could possibly merge these caches across these layers of the stack, but maybe if we unify more things, this is going to be possible. AUDIENCE: And maybe it might be worth also looking if you can comment a little bit about-- you said the kernel cache is something that is very unique to the eager execution versus something that we'd have in graph mode? ALEX PASSOS: No, we already had the kernel cache in graph mode. I mean, it has a slightly different kernel cache in eager execution, but we already needed a kernel cache in graph mode, too, because creating a kernel might require allocating memory, which might be expensive. It definitely requires allocating a vtable pointer. But in some kernels, you have to allocate a lot more than that. AUDIENCE: We just had some cases where there was memory bloat because of the kernel cache. Is there anything that users need to be aware of for that, or is something that the runtime needs to-- ALEX PASSOS: Hopefully our users should not be aware of it. There are some cases now where you have to be aware of it, like if you're using the V1 random number generator ops, the only way to reset the random seed requires resetting the kernel cache because states step into kernels. As we move to the V2 random number ops, we don't make this type of mistake, and so the space overall taken by the kernel cache should become much smaller. I also think for the particular case of functions, we should be able to garbage collect that cache a little better. Yeah. OK. Oh, yes. AUDIENCE: So when XLA is enabled, we don't have this shape int32 problems anymore, right? ALEX PASSOS: XLA has a different notion of kernel and computation, so by giving up the possibility of having any dynamic shapes at all, it can effectively-- XLA only works when you can ConstantFold all the shape tensors away, and then it doesn't matter if you ConstantFolded them on the GPU or ConstantFolded them on the CPU but they're not there. This is only a problem when you have runtime dynamic shapes. AUDIENCE: Can you comment on the garbage collection in the runtime? ALEX PASSOS: What part-- AUDIENCE: In terms of what is garbage collected and what is that garbage-- ALEX PASSOS: Most of our runtime is ref counted, not garbage collected. AUDIENCE: Oh, ref counted. ALEX PASSOS: Yeah. If you-- there's a class in TensorFlow correlated core refcount.h. It's a base class for a ref counted pointer, and we use that instead of SharedPtr because it's a smaller memory footprint and it has better cache locality behavior. So you should be able to just read that and find the places of the runtime that inherit from it, and you can see the things which are ref counted. AUDIENCE: But we currently have no garbage collection for the kernel cache. ALEX PASSOS: The kernel cache is not garbage collected, correct. But almost everything that can be ref counted already is ref counted. A few things, like the kernel cache, are not, because it-- ref counting caches feels weird. But in some cases, like when you're caching the kernel for a function, it actually makes sense to ref count it. AUDIENCE: Can we put a limit on that kernel cache? ALEX PASSOS: In principle we can do, yes. It's, you know, memory versus performance tradeoff. Assuming we are not dealing with the v1 random number ops, because those, if you are evicted from the cache, you now change the sequence of random numbers you would get and that's pretty bad. OK, thank you. [APPLAUSE] [MUSIC PLAYING]
B1 kernel device runtime cache execute executor Inside TensorFlow: Eager execution runtime 2 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary