Subtitles section Play video Print subtitles FRANCOIS CHOLLET: Hello, everyone. I'm Francois. And I work on the Keras team. I'm going to be talking about TensorFlow Keras. So this talk will mix information about how to use the Keras API in TensorFlow and how the Keras API is implemented under the hood. So we'll cover another view of the Keras architecture. We'll do a deep dive into the layer class and the model class. We'll have an overview of the functional API and a number of features that are specific to functional models. We'll look at how training and inference work. And finally, we'll look at custom losses and metrics. So this is the overview of the Keras architecture and all the different submodules and the different classes you should know about. The core of the Keras implementation is the engine module, which contains the layer class-- the base layer class from which all layers inherit, as well as the network class, which is-- it's basically kind of modeled its directed acyclic graph layers-- as well as the model class which basically takes the network class but adds training and evaluation and sitting on top of it, and also the sequential class, which is, again, another type of model which just wraps a list of layers. Then we have the layers module, where all the action instances-- usable instances of layers go. We have losses and metrics with a base class for each, and a number of concrete instances that you can use in your models. We have callbacks, optimizers, regularizers, and constraints, which are much like the modules. So in this presentation, we go mostly over what's going on in the Engine module, and also losses and metrics, not so much callbacks, optimizers, regularizers, and constraints. So in general, for any of these topics, you could easily do a one-hour talk. So I'm just going to focus on the most important information. So let's start with the layer class. So the layer is the core abstraction in the Keras API. I think if you want to have a simple API, then you should have one abstraction that everything is centered on. And in the case of Keras, it's a layer. Everything in Keras pretty much is a layer or something that interacts closely with layers, like models and instances. So a layer has a lot of responsibilities, lots of built-in features. At its core, a layer is a container for some computation. So it's in charge of transforming a batch of inputs into a batch of outputs. Very importantly, this is batchwise computation, meaning that you expect N samples as inputs, and you're going to be returning N output samples. And the computation should typically not see any interaction between samples. And so it's meant to work with eager execution, also graph execution. All the built-in layers in Keras support both. But user-written layers could be only eager, potentially. We support having layers that have two different modules-- AUDIENCE: So this would mean that using different layers can support either in graph or on eager? FRANCOIS CHOLLET: Yes. AUDIENCE: Yeah, OK. FRANCOIS CHOLLET: That's right. And typically, most layers are going to be supporting both. If you only support eager, it typically means that you're doing things that are impossible to express as graphs, such as recursive layers, such as SEMs. This is actually something that we'll cover in this presentation. So, yeah, so layers also support two modes-- so a training mode, an inference mode-- to do different things. And each mode, which is something like dropout layer or the batch normalization layer. There's a support for built-in masking, which is about specifying certain features of timestamps and inputs that you want to ignore. This is very useful, in particular, if you're doing sequence processing with sequences where you have padded time steps or where you have missing time steps. A layer is also container for state, meaning variable. So, in particular, a trainable state-- the trainable weights on the layer, which is what parametrizes the computation of the layer and that you update during back propagation; and the nontrainable weights, which could be anything else that is manually managed by the layer implementer. It's also potentially a container that you can use to track losses and metrics that you define on the fly during computation. This is something we'll cover in detail. Layers can also do a form of static type checking. So they can check-- there is infrastructure that's built in to check the assumptions that the layer is making about its inputs that we can raise nice and helpful error messages in case of user error. We support state freezing for layers, which is useful for things like fine-tuning, and transfer learning, and GANs. You have infrastructure for serializing and deserializing layers and saving and loading a state. We have an API that you can use to build directed acyclic graphs of layers. It's called a functional API. We'll cover it in detail. And in the near future, layers will also have built-in support for mixed precision. So layers do lots of things. They don't do everything. They have some assumptions. They have some restrictions. In particular, gradients are not something that you specify on the layer. You cannot specify custom a backwards pass on that layer, but this is something we're actually considering adding, potentially, something like a gradient method on the layer. So it's not currently a feature. They do not support most low-level considerations, such as device placement, for instance. They do not generally take into account distribution. So they do not include distribution-specific logic. At least, that should be true. In practice, it's almost true. So they're as distribution agnostic as possible. And very importantly, they only support batchwise computation, meaning that anything a layer does should start with a tensor containing-- or a nested structure of tensors containing N samples and should output also N samples. That means, for instance, you're not going to do non-batch computation, such as bucketing samples of the same length. When you're doing time-switch processing, you're not going to process [INAUDIBLE] data sets with layers. You're not going to have layers that don't have an input or don't have an output outside of a very specific case, which is the input layer, which we will cover. So this is the most basic layer. You could possibly write it as a constructor in which you create two-tier variables. And you say these variables are trainable. And you assign them as attributes on the layer. And then it has a call method, which essentially contains the batch of inputs to batchify this computation, in this case, just w x plus b. So what happens when you instantiate this layer is that it's going to create these two variables, set them as attributes. And they are automatically tracked into this list, trainable_weights. And when you call the layer using __call operator, it's just going to pass. So it's going to defer to this call method. So in practice, most layers you're going to write are going to be a little bit more refined. They're going to look like this. So this is a lazy layer. So in the constructor, you do not create weights. And the reason you do not create weights is because you want to be able to instantiate your layer without knowing what the input shape is going to be. Whereas in the previous case, here-- so this is the previous slide-- you had to pass the input dimension as a constructor argument. So in this case, you don't have to do this because you're going to create the state in a build method, which takes an input shape argument. And when instantiated, the layer, it does not have any weights. And when you call it for the first time, this __call operator is going to chain the build method that you have here and the call method. And the build method, you see, we use these add weight shortcut. So it's basically just a slightly shorter version of creating a variable and assigning it on the layer. OK. Layers can also have nontrainable states. So a trainable state is just variables that are tracked in trainable_weights. Nontrainable state is tracked in non_trainable_weights. It's very simple. So in this layer, in the constructor, you create this self.total scalar variable that starts at 0. You specify it's nontrainable. And in the computation method, you just update this variable. And basically, you just keep track of the total sum of the inputs seen by this layer. So it's a kind of useless layer. And as you see, every time you call this layer, the value of these variables is being updated. Layers can be nested. So you can set-- you can set layer instances as attributes to a layer, even if they're unbuilt, like here. And when you do that, the outer containers-- so in this case, the MLPBlock instance-- is going to be keeping track of the trainable weights and nontrainable weights of the underlying layers. And all these layers, which in the constructor are unbuilt, are going to be built, so have their variables created the first time you call the outer instance. And this is the most basic way in which you can be using a layer. You would just instantiate-- you know-- grab some loss function, which could be anything. Grab some optimizer. Iterate with some data-- so we have input data and targets. Open the GradientTape. Call the layer inside the GradientTape-- so that's operations done by the call method and recorded on the tape. Call your loss function to get some loss value. And then you use the tape and the loss value to retrieve the gradients of the trainable state of the layer. Then you apply these gradients. So this is a full, end-to-end, training loop. By that point, you know about layers, which are containers for state and computation; you know about trainable state, nontrainable state. You know about nesting layers. And you know about training them with this kind of loop. So typically, you would put a spot of the loop-- like everything starting with opening GradientTape-- ending with applying the gradients. You put that in a TF function to get a graph execution and faster performance. So when you know all these things, you can use the Keras API to implement literally anything. However, it's going to be a fairly low-level way of implementing things. So by that point, you know everything except you actually know nothing. So let's go further. One neat feature is that you can use layers to keep track of losses and also metrics that you define on the fly, doing computation. Let's say, for instance, with our linear layer after we just compute w, x, we want to keep track-- we want to add an activation loss on the value x, which is just going to be the sum of this output times-- So it's not actually a great loss because it should probably be squared-- sum of the square instead of just sum, but whatever-- times some factor. And when you have a layer like this, every time you call it, the scalar tensor that you added here in this add_loss call is going to be tracked in this layer.losses list. And every time you call the layer, this gets reset. When you have nested layers, then the outer container is going to keep track of the losses of the inner layers. And you can call the inner layers multiple times. It's not going to reset the losses until you actually call the outer container. So the way you would use this feature is something like this. After you open your GradientTape and you call your layer and you compute the main loss value, you would add to this loss value the sum of the losses collected during the forward pass. So you can use this feature to do things like weight triggerization, activity regularization compute things like the KL divergence, so all kinds of losses that basically are easier to compute when you have access to intermediate results in the-- AUDIENCE: Just a question. So if you have call at loss in a inner layer, does it call-- but that layer is contained in another layer-- does it call add_loss on the outer layer too? FRANCOIS CHOLLET: Yes. So for instance, if linear layer and multiple layers inside it, when you retrieve this, this should be a linear layer that losses-- not [INAUDIBLE] losses, whatever. It's going to recursively retrieve all the losses, including the top-level losses. AUDIENCE: So, I guess my question is, does a layer know that it's being called from inside the-- FRANCOIS CHOLLET: That's correct, meaning that's when it's called from inside, you can create multiple terms. It's not going to reset the losses. The losses are only reset when you call the top level container. So there is a call context thing. AUDIENCE: That's-- I would expect it to be reset every time you call it, but the parents' losses [INAUDIBLE].. FRANCOIS CHOLLET: So if you do that, you could not share a layer that creates losses inside a bigger model. AUDIENCE: I mean, I guess I was thinking that the inner layer would reset, but the outer layer would not reset. So it would keep-- as long as all the inner layer losses [INAUDIBLE].. FRANCOIS CHOLLET: So they're gathered on the fly. So that's not exactly accurate. But yeah, anyway, so yeah. AUDIENCE: How does the resetting happen? Can you explain? FRANCOIS CHOLLET: Yeah, sure. Basically, you just-- so you are going to-- so you can-- it's called at the end of __call for the outer container. And it's called recursively. So it's going to clear the losses of all the inner layers. If you want to do it manually, all layers and models have a reset losses, I believe it's called, method that you can use to force clear all the losses, which could happen, for instance, if you have multiple calls of the same model. [INAUDIBLE] potentially the [INAUDIBLE] use case could be-- anyway, AUDIENCE: Sorry, so I didn't understand how reset losses is not called. How does a layer know that it's been called as-- from an outer layer? AUDIENCE: In _call, there's basically a contact manager that sort of says you're in _call. And so that's why as you go down the line, if you're calling a layer that's already being called inside another layer, it can use that contact manager to know whether it's the top-level call. AUDIENCE: OK. FRANCOIS CHOLLET: So, yeah. So layers also support serialization. So if you want to make a layer serializable, you just implement a get_config method, which typically just packs the constructor arguments into a dictionary. And when you've implemented this get_config method, you can serialize your layer as this [INAUDIBLE] config dict, which is JSON serializable. And you can use it to re-instantiate the same layer. So this does not keep track of the state of the layer, meaning the value of the weight. So this is done separately. And so layers also supports two modes-- training mode and inference mode. If you want to use this feature, you would have a training argument in call. So this is a very simple example of a BatchNormalization layer, where, when you're in training mode, you're going to be computing the mean and variance of the current batch. And you're going to use these statistics to normalize your inputs. And you're going to be updating the moving mean and variance on the layer, which are nontrainable weights here. And if you're in inference mode, you're just going to use the moving statistics to normalize your inputs. So now let's move on to the model class. You saw in one of the previous examples that layers can be nested. If you just switch in this example from, I think, the MLP class inherit from the model class instead of the layer class, then essentially nothing changes except that now you have access to a training and evaluation and inference and saving API. So once you've inherited from model, you can do things like mlp.compile with an optimizer in the loss instance. Then you can call fit, which is going to automatically iterate over this data set and minimize this BinaryCrossentropy from logits loss using the Adam optimizer. It's going to iterate 10 times over the data set. And you can also save the state of this model with this mlp.save method. So what's the difference between the layer and the model? In short, it's that a model handles top-level functionality. So a model is a layer. So it does everything layer does in terms of network construction. It also has these compile, fit, evaluate, and predict methods. It's about saving. So when you call save, that includes not only the configuration of the model, like the get config thing we saw previously. It also includes the state of the model and the value of the weights. It also includes the optimizer that the model was compiled with and the state of the optimizer. It also supports some basic forms of model summarization and visualization. I can call model Summary, which is going to return a description of all the layers inside the model and the number of parameters that the model uses and so on. In short, the layer class corresponds to what is usually referred to as a layer, like when you talk about convolution layer, recurrent layer, so on. It can also be used to refer to as what is sometimes called a block, like a resonant block or inception block. So a layer is basically either a literal layer or block in a bigger model. And the model is really like the top level of things-- the outer thing-- like what people refer to as a model or a network. So typically, what you will do is use the layer class to define inner computation blocks and use the model class to define the one outer model-- so the thing that you're actually going to be training and saving and exporting for production. For instance, if you are working on the ResNet50 model, you'd probably have several ResNet blocks subclassing the layer class. And then you would combine them into one big subclass model on which you would be able to compile and fit and save and so on. One situation that's specific to TensorFlow 2 that not many people know about is that by default, when you call compile and fit, you're going to be using graph execution because it's faster. However, if you want to execute your model eagerly, you can pass this run_eagerly argument in compile. You can also just set it directly as an attribute on the model instance. So when you do that, all your call methods are in the top level MLP model. And so all the inner layers are going to be executed eagerly. If you don't do this, by default, you're going be generating a graph function that does the competition, which is faster. So-- AUDIENCE: Before you go on, I had a question. Could you explain the difference between compile and fit? Like what goes between-- what goes in compile and what goes in fit? I don't feel like that's a-- FRANCOIS CHOLLET: Right. So compile is about configuring the training process. So you specify which optimizer you want to use. You specify which loss you want to minimize. You specify which metrics you want to try. And anything else is going to modify the way the computation is done. It's going to modify the execution function, which is something we're going to go over in more detail. And in fit, you're passing the data itself and information about how you want this data to be processed, like how many times you want to iterate over the data, potentially the size of the batches you want to use to iterate with the data, which callbacks you want to be called at different stages of training, and so on. So its configuration and compile, and basically passing the data and related metadata in fit. So typically, you compile your model once. But you might be calling fit multiple times. AUDIENCE: So are we to think about it is if in TF1 style, the stuff that doesn't compile is the stuff that you'd use when you're building your graph. And the stuff that goes to fit is simply batch.session.run? FRANCOIS CHOLLET: Yes. That's correct. So let's move on to functional models. So in the previous example, you saw a subclass model, so essentially something that you wrote subclassing the model class. In practice, very often in the literature, you see the planning models that look like this, that look like directed acyclic graphs. So on top, you have [INAUDIBLE]. At the bottom, you have various bits of transformer. So these are directed acyclic graphs of layers that are connected with these arrows. So there's an API in Keras for configuring the connectivity of the directed acyclic graph of layers. It's called a functional API. It looks roughly like this. You start with input nodes, which are like these input objects. And then you're going to be calling layer instances on that object. So you can think of this input object as a spec describing a tensor. So doesn't actually contain any data. It's just a spec specifying the shape of the input you're going to be expecting, the data type of the input you're going to be expecting-- maybe include-- so annotate with the name. And every time you call layer, that's roughly the action of drawing an arrow from one layer instance to the next. So here, you're going to have one input node, two layers. And here you are drawing an arrow from the input node to the first layer. Here you're drawing an arrow from previous layer to this new layer, and doing that for the output layer finally. And then you can instantiate a model by just specifying its inputs and its outputs. And what it's going to do when you do this is basically build this directed acyclic graph, right here. So you can actually plot this graph. You can call utils.plot_model on your model instance. It's going to generate this image. So a functional model is basically just like any other model. But it's just a model that you do not write yourself. So it has a bunch of methods that are autogenerated. In particular, the call method is autogenerated. The build method is autogenerated. And the serialization get_config is autogenerated. Yes? AUDIENCE: You said the input does not have data. But could it have data? Like if you wanted to be able to check you work as you went along. FRANCOIS CHOLLET: No. It could not. So when you define your model, you're not actually executing anything. You're just configuring a graph. So it's going to be running a bunch of checks. But they're only compatibility checks, essentially checking that the layers that you are connecting together are compatible, that the data can actually be transmitted. But there's no execution. So there's no notion of data. It's really just an API for building a DAG. So, yeah, so for instance, the call method is autogenerated. It's just going be something that we called a graphic layer executor. And so when you call your model, it's going to be basically running through this graph of layers, calling each layer in succession with the proper inputs. And likewise, assuming that each layer in your graph of layer implements is get_config method, then you can call get_config on your DAG and get something that you can use to re-instantiate the same model. AUDIENCE: Excuse me. So can we go back to [INAUDIBLE].. So on the line, y equals to model x. So the x is a layer of the input x. FRANCOIS CHOLLET: x is the output of the layer. So you start by instantiating this input object using tf.keras.Input, which is basically a spec for a tensor. It has a-- AUDIENCE: This-- the x here is different than the x above. FRANCOIS CHOLLET: Oh, right. Yeah, sorry. This is confusing. Yeah. Yeah. This is supposed to be a tensor, like an eager tensor, for instance. Right. Sorry. Yeah. It's a bit confusing. Sorry. Yeah, so you can call your model like you would call any other model instance or any function on a bit of data. It's going to return a bit of data. So what does a functional API do, roughly? Well, it's really an API for configuring DAGs. It's targeted more at users than developers. People who are very comfortable with writing classes and functions in Python will have no issues using model subclassing to create new models. But when you want to use more like building APIs and you want more handholding, then the functional API is actually a lot easier to use. And its maps very closely to how you think about your model. So if you're not a Python developer because you think about your models typically like that, in terms of DAGs of layers. So it's declarative, meaning that you write no logic. You write no functions. You subclass nothing. It's really just configuration level. So all the logic that's going to be executed during-- when you actually call your model on some data, it's contained inside of layers. Typically, you don't write it. So that means that all debugging that happens when you build these models is done statically at construction time in terms of compatibility checks between layers. That means that any functional model that you can instantiate without the framework, creating as you-- is the model that's going to run. So it means you don't write any actual Python code. You're just connecting nodes in a DAG, meaning you don't actually write bugs. The only kind of bugs you can be writing is misconfiguration of your DAG topology, which is what they call topology debugging. And that can be done visually by literally printing your graphs of layers and looking at how they're connected. And on the [INAUDIBLE] sides, these models can be expressed as static data structures that can generate code-- that can generate a call method, meaning that they're inspectable. For instance, you can retrieve intermediate activations in your DAG and use them to build a new model. That's actually very useful when you do transfer learning because it makes it very easy to do feature extraction. That means your models are also plottable. You can actually generate these little graphs because you have a literal data structure in the random model. And you can serialize them. When you have subclass model in your model topologies, it's actually a bit of a Python byte code, which is harder to serialize. You have to [INAUDIBLE] it. There's one very important restriction when you're writing layers that should be compatible with the functional API, which is that all the inputs of your layer should be contained in the first argument, meaning that if you have multiple inputs, you should use [INAUDIBLE] argument or maybe dictionary argument if you have many inputs and they have names. AUDIENCE: Tensor inputs? FRANCOIS CHOLLET: Yes. Tensor inputs. So essentially, anything that comes from another layer. So like anything that you want to transfer using these arrows. Each arrow corresponds to the first argument in the call of the layer. So this is a restriction that also exists in Torch 7, in case you know. A lot of people who have been criticizing this restriction are people who say Torch 7 is the pinnacle of deep learning API. So I think this is funny. OK. So what actually happens when you build this functional model? Let's go in detail. When you instantiate this input object, it basically-- this spec object of the shape and dtype-- when you create it, it also creates an input layer, which is basically a node in your graph of layers. And this input objects is going to have an _keras_history field of metadata on it that tracks who created it, where it comes from. And every time you call a layer instance on one of these spec objects, what it's going to be doing is returning a new spec object-- nested structure spec object-- with the inferred shape and dtype corresponding to the computation that would normally be done by the layer. We can practice-- no actual computation is happening when you do this. And this output has an updated _keras_history metadata that tracks the node that you just created in your graph of layers. And finally, when you instantiate your model from these inputs and outputs, you're going to recursively reconstruct-- retrieve every node that has been created and check that they actually do form a valid DAG of layer calls. AUDIENCE: So this _keras_history field, does that have transitively a reference to all the layers that were-- FRANCOIS CHOLLET: So these are weighted contents. So it's basically-- keras is just basically the coordinates of your tensor in a 3D space-- discrete 3D space. It's basically a tuple-- a named tuple with three entries. So the first one is the reference to the layer that's created this tensor-- this spec. The second one is the node_index because layers can be called multiple times. So it's not true that that's actually one node that's instantiated with one layer. A node is instantiated with a layer call. So if you call layer instance multiple times, there's multiple nodes. So this node index is basically the index of the nodes created by the layer as referenced in layer._output_nodes. And finally, there's a tensor_index. So the tensor_index is basically to handle the case of multioutput layers. If you have a layer with a bunch of tensor outputs, what they're going to do is deterministically flatten these outputs into a list and then index this list. And this tensor_index tells you the index of this specific tensor among the values since that's returned by this layer call. AUDIENCE: Can you [INAUDIBLE] why, and I just call like tf.relu on it, it will still populate the keras history and-- FRANCOIS CHOLLET: Not immediately. So the tf.relu, tf.nn.relu, is not going to create this _keras_history object. But when this object is seen again by a layer, it's going to check-- it's going to walk the graph and check whether it was originally coming from a layer. And if it does, then it's going to wrap the various ops that were not containing layers. It's going to wrap them into new layers so that you can actually step out of the keras graph of layers and insert any TensorFlow op. And each TensorFlow op is going to be treated as its own nodes in the graph of layers. AUDIENCE: So two questions. One is, let's say that happened twice. Would there be two Relu layers created? Or would it-- FRANCOIS CHOLLET: Yes. So it's one layer per op. AUDIENCE: --same node in the graph? FRANCOIS CHOLLET: Yeah. So one node corresponds to one call of the layer. So every time you call layer, I mean, even if it's the same layer that's in your node, because it has to be a DAG. AUDIENCE: OK. So I guess when you say Relu, it creates a tensor, right? FRANCOIS CHOLLET: Yeah. AUDIENCE: And that tensor is then passed to another layer? FRANCOIS CHOLLET: Yeah. AUDIENCE: At that point, does it create a layer for the Relu? FRANCOIS CHOLLET: That's correct. Yes. AUDIENCE: So right then. And then if I were to pass that output of the Relu to another layer-- not the first layer-- would it create another layer for the-- [INTERPOSING VOICES] --reuse the-- FRANCOIS CHOLLET: I believe it does not recreate the layer. So it reuses the previous layer. But I will have to check. AUDIENCE: This is all making me wonder about the lifetimes of all these layers. Is there references to all these layers just kept forever? Or is that-- FRANCOIS CHOLLET: Yes. They are kept forever. AUDIENCE: Layers are never garbage collected. FRANCOIS CHOLLET: Yes. There's actually a utility in the back end to force destroy all of this. It's called clear session. So, yeah, so this illustrates the importance of having these three coordinates. You can have-- so this is some random variation of autoencoder example I took from the internet. So it's interesting because it shows layers that have multiple inputs and multiple outputs. And, like, for instance, the outputs of this layer are going to be going into-- one of the outputs is going to be going into this other layer. One of the other outputs is going to be going further downstream. So with these three coordinates, you can do completely arbitrary graph topology in that case. So there's a lot of keras features that are specific to these functional models, in particular, the ability to do static compatibility checks on the inputs of a layer; the ability to do whole-model saving, meaning saving a file that enables you to reinstantiate the exact same Python objects with the same state; the ability to do model plotting, which is something we already just saw; automatic masking, which is something which we cover in detail; and dynamic layers, which is something that's not really relevant if you're not using the functional API. So let's talk about static type checking. When you create a custom layer, you can set an input_spec field on it, which is going to have to be an instance of this input spec object or maybe a nested structure input spec object, which describes the expectations that the layer has with regard to which we calling it on. And when you call your layer for the first time here-- you instantiate it here when you call it-- first, it's going to-- sorry-- it's going to check that this input is compatible with the current input spec, which was set here in the constructor, which just says the tensor should have at least rank 2. Then it's going to actually build a layer. But here, using this build method, the input shape that's passed here gives us more refined information about the expectations of the layer. So we update the input spec. So not only should it be a rank at least 2. The last axis-- so axis minus 1. It's the last axis-- should have exactly this value. So after you build, it's going to recheck that the updated input spec is compatible with this input. That's the right last dimensions. And finally, it's going to call the [INAUDIBLE].. So every time you call a layer in the functional API to build these graphs, if the layer has set this input spec object, it's going to be checking compatibility and raising a very detailed-- and, therefore, our messages in case of the compatibility, it's going to tell you what you passed, what was expected, and what you can do to fix it. You can also do whole-model saving, meaning that you have this get config method that is autogenerated. You can use it to recreate the same model, in this case, without the state. You can also just call save. When you call load_model, then this object is the exact same Python object pretty much with the same topology, the same state. And you can load it across platforms. For instance, you can also load it in pascal.js. Yes. AUDIENCE: If I created my own custom layer, do I need to do something special-- FRANCOIS CHOLLET: Absolutely. So if you want to write custom layers and reload your model in a completely different environment, that environment should have access to some implementation of your custom layer. So if it's a Python environment, then you basically just need to make sure the code is available. And you would wrap your load_model call inside a scope where you specify the custom objects you want to be taken into account during the deserialization process. If you want to load your model into a JS, for instance, you first have to write a JavaScript/TypeScript implementation of your layer. And, yeah, model plotting, which is something we already saw, it's pretty useful when you want to check the correctness of the DAGs that you're building unless they're too large, in which case it's not so great. So this is just a very simple two input, two output model-- three input, sorry, two output model-- where you have some title fields, some body fields, some DAG field. They're doing some processing. And you end up with a priority prediction and department predictions. And this is just something from some random tutorial. And then one neat feature is automatic masking. So let's go over that in detail. Here's just a simple end-to-end example. You start from an input object that's going to be a variable length sequence of ints. It's called word sequence. So it's just going to be a sequence of word indices. You embed it with this embedding layer. And in the embedding layer, you specify mask_zero equals true. So this layer is going to be generating a mask using zero entries in any data you're going to be passing along this graph connection. And every time you call a layer that's compatible with masking, it's going to pass through this mask. And some layers, like the LSTM layer, are going to be mask consumers. So they're going to be looking at the mask that's passed and use it to ignore the padded entries. And some layers-- for instance, if you have an LSTM layer that does not return sequences and that just basically just returns a single vector, the sample, including the entire sequence, it's going to be destroying the mask. So the next layer is not going to be seeing the mask. So when you do something like this, you're automatically telling the LSTM layer, which is significantly downstream from your embedding layer, to do the right thing with your sequences. So if you're not super-familiar with the specifics of masking, this is very simple and magical. You just say mask_zero at the start of your model. And suddenly, all the layers that should be aware of masking just do the right thing. AUDIENCE: A little more detail about what-- what is masking? Is it like-- is there a vector of Booleans or something? FRANCOIS CHOLLET: Yes. Absolutely. So here's the detail. A mask is, indeed, a vector-- a tensor of Booleans. Each sample has its own mask. And each mask is basically just a plain vector of ones and zeros. It does not make assumptions about things like padding, for instance. So you could mask completely arbitrary time steps. Yeah. So you have three types of layers that interact with mask. You have layers that will consume a mask. So in order to consume mask, just specify this mask argument in the call signature. And this will be your batch of Boolean vectors. You can also pass through a mask. There's almost nothing you need to do. But it's opt-in. You need to explicitly say it supports masking. The reason why this is opt-in is because many layers are going to be changing the shape of the inputs that are going to be returning outputs that are not the same shape as the inputs. And this interacts with masking. So it's typically not true to assume-- for instance, here-- to typically be not true to assume that the mask that needs to be consumed by this LSTM layer is the same that was generated by the embedding layer. When in practice, it's the case. But if this dense layer had been doing anything to change the shape of the inputs, it would have been different. And then you have mask-generating layers. So for instance, the embedding layer does something like this. It looks at the inputs and gives you a mask of Boolean entries that is one for all non-zero entries in your input. And in case you have a layer that modifies the shape of your inputs, it should also-- if it wants to be compatible with masking-- should also implement this computation called mask method. For instance, if you have concatenate layer that takes multiple inputs-- some of which may be masks, some of which may not-- the output mask should do the concatenation of the different masks. AUDIENCE: So that compute mask method didn't use the mask argument. But normally you would? FRANCOIS CHOLLET: Sorry, what? Oh, yeah, yeah. So you can-- yes. So for instance, the embedding layer is going to ignore the mask argument and just generate the mask based on inputs. But if instead you have a concatenate layer, it's going to ignore the inputs and just generate the mask based on the prior mask argument. So let's look in a lot of detail at how all of this is implemented. So what happens when you call a layer instance on a bunch of symbolic inputs? So first thing we do is static type checking-- determining whether all the inputs are compatible with this layer. If the layer is unbuilt, we're going to build it-- potentially check input compatibility, again, in case input spec was updated. We're going to check whether this layer is mask consumer. Does it have a mask argument in its call method? If yes, we're going to retrieve the mask associated with the inputs of the layer, which we do via metadata. We're going to open a graph scope, check if our layer is graphable. So this is a concept we're going to look at in more detail afterwards. If the layer can be turned into a graph, we're going to autograph call automatically. And we're going to call this-- so this is in order to convert if statements, for instance, or for loops, into symbolic conditionals. We're going to call the call method that was autographed using the proper mask and training argument that we retrieved from the current context. If the layer happens to be dynamic, meaning that you cannot convert it to graph, you're just going to return a brand new symbolic tensors. And in order to know what shape and detail these symbolic tensors should be, you're going to use the static shape inference method of your layer. Meaning that if you have a layer that's dynamic, that's nongraphable, and you want to use it in functional API, it should implement this static shape inference method, compute output shape. For no other use case are you going to need compute output shape. So finally, you create the nodes in the graph of layers from this call. You set the metadata on the outputs, which are either-- this is brand new symbolic tensors created using static shape inference-- or the outputs of the actual graph mode call, so with the metadata above the node. And finally, you set the mask metadata, which is what the next layer is going to retrieve in case that layer is a mass consumer. AUDIENCE: So what's happening in step 5? What is the graph scope that you're talking about? FRANCOIS CHOLLET: So Keras maintains its own graph, which is a fun graph object. And when it creates a symbolic tensor, it's always in that graph. So before you call the graph mode call or before you instantiate new symbolic tensors, which are basically placeholders, first you need to open the graph scope. AUDIENCE: Slight correction. It will enter that graph unless a different one has been specified. FRANCOIS CHOLLET: Yes, which is only valid in V1. In V2, typically, we only ever-- like the only graph you're going to be manipulating is the Keras graph. Everything else is going to be either eager or TF function. So we mentioned the notion of dynamic layers. So what's dynamic layer? Well, maybe you remember this BatchNormalization example. There's actually something very subtle going on with this BatchNormalization example, which means that it cannot be turned into a static graph. And that is because-- so it uses this if/else statement. And inside one of the branches, it does variable updates and the other branch, it does not. And this actually does not play well with autograph. So this is actually something that's fairly easy to fix by simply having symmetrical conditional branches, where you have the same statements assigning nothing in the other branch. However, for the sake of argument, let's say we cannot graph this layer. What are we going to do? Well, we are going to pass this dynamic equals true argument in the constructor. And that tells the framework that this layer is not compatible with graph execution. It should never be executed in a graph. And when you build a functional API model using this, it's going to do what we were mentioning in step 6. It's just going to use static shape inference to compute the outputs. And when you call fit, it's going to be using pure eager execution without forcing you to specify [INAUDIBLE] equals true in compile. It's just automatically set to the right default. AUDIENCE: So I think this one actually works because it should retrace for different bodies of training. FRANCOIS CHOLLET: This? AUDIENCE: This particular example should work in a graph with-- FRANCOIS CHOLLET: Last time I checked, it would not work with autograph unless-- AUDIENCE: Its final training is in Python Boolean. But if it's a tensor, it's not. FRANCOIS CHOLLET: Implicitly, it's actually tensor because-- so the reason why training is usually a tensor is because we use the same graph when you do fits and evaluates. And we change the value of training by fitting the value. Training is always symbolic. But yeah, you're right. If it's a Python Boolean, this works fine. In fact, autograph is not even relevant because the if statements are just going to be ignored by autograph. AUDIENCE: What's the actual problem with [INAUDIBLE]?? I thought it was fine to have the [INAUDIBLE] up on one side of the [INAUDIBLE].. AUDIENCE: Is that there isn't a corresponding output. And so they're-- like autograph can't match the function signatures? AUDIENCE: Aren't those not outputs of each branch, though? So it would just be the [INAUDIBLE] in-- It's like an API issue with [INAUDIBLE] or something. AUDIENCE: Yes. It's a built-in issue. There's a long thread on this. FRANCOIS CHOLLET: Potentially fixable. It's potentially fixable. Anyway, this was just for the sake of example. There are also layers that are more fundamentally nongraphable, like a tree LSTM, for instance. AUDIENCE: [INAUDIBLE] to that basically, the second question is, why is it important that you use the same wrap for both fit and evaluate? Given that, for instance, in graph mode, the training versus inference flag, like in the olden days, I think that was a Python Boolean, right? FRANCOIS CHOLLET: Yes, that's correct. That's the reason why, historically, you had one graph for string and one graph for inference is because you would do inference on a separate machine that would load a checkpoint and run asynchronously compared to the main training. That's what we're used to at Google. In Keras, however, there's going to be a running evaluation at the end of each epoch. So potentially, you're going to be running evaluation very often. And you're going to be doing that on the same machine. So it's actually quite a bit more efficient to use a single graph to do this instead of keeping it on two graphs. AUDIENCE: If you have something and you call batch normal on it, that if they're in the same graph-- for instance, you don't have to declare at that time whether you're doing training and inference-- you can just have your tensor. You can do whatever you want downstream with it. Whereas, if you have separate graphs, if you, for instance, output like a result in training mode and a result in inference mode, then the user has to track that. And it's just-- it's not as pleasant. AUDIENCE: Certainly, it should be maybe different graphs that share all of the variables or even different functions. AUDIENCE: Different-- AUDIENCE: Yeah. AUDIENCE: Is this something that's like maybe an implementation detail that-- FRANCOIS CHOLLET: It is very much an implementation detail. We could choose to generate a new graph when you do evaluation. Again, the reason this decision was made initially is really efficiencies because two graphs is actually-- [INAUDIBLE] graph this big. And even though the two graphs are almost entirely redundant, they only differ by a few counts for an actual part layer and batch normal layers. So we saw that having a symbolic training argument is actually much more efficient.
B1 layer francois graph model input instance Inside TensorFlow: tf.Keras (part 1) 7 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary