Subtitles section Play video Print subtitles FRANCOIS CHOLLET: So last time, we talked about a bunch of things. We talked about the functional API for building types of layers. We talked about features that are specific to the functional API-- things like static input compatibility checks across layers every time you call a layer, whole-model saving, model plotting, and visualization. We talked about how masking works in the functional API and about how masks are propagated. So for instance in this example, the Embedding layer is going to be generating a mask here because you passed this argument mask_zero=True. And this mask is going to be passed to every subsequent layer. And in particular, to layers that consume the mask, like this LSTM layer here. However, in this LSTM layer, because it's a sequence reduction layer, it's not going to be returning all the sequences, but only the last output. This is going to destroy the mask, and so this layer is not going to see a mask anymore. So it's really a way to handle masking that works basically magically. Masking is pretty advanced, so most people who need it are not actually going to understand very well how it works. And the idea is to enable them to benefit from masking by just doing, like, hey, this Embedding layer, I want the zeros to mean this is masked, and then everything in their network is going to magically know about this as long as they are using built-in layers. If you're an implementer of layers, you actually need to understand how it works. First of all, you need to understand what you should be doing if you're writing a layer that consumes a mask, like an LSTM layer, for instance. It's very simple. You just make sure you have this mask argument in the signature and expects a structure of tensors that's going to match the structure of your inputs. So it's because it's going to be a single tensor. And the single tensor is going to be a Boolean tensor, where you have one mask entry per timestep per sample. So it's typically a 2D Boolean tensor. If you have a layer that can safely pass through a mask, for instance, a dense layer or in general any layer that does not affect the time dimension of its inputs, you can just enable your layer to pass through its mask by saying it supports_masking. It is opt in because a lot of the time, layers might be affecting the time dimension of the inputs, in which case the meaning of the mask would be changed. And if you do have a layer that changes the time dimension or otherwise a layer that creates a mask from input values, it's going to need to implement this compute_mask method that is going to receive the mask, receive the inputs. If, for instance, you have an Embedding layer, it's going to be doing this-- not_equal inputs, 0. So it's going to be using the input values to generate a Boolean mask. If you have a concatenate layer, for instance, it's not going to be looking at the input values, but it needs to look at the masks-- the two mask that's being passed-- and concatenate them. And if one of the masks is [? non, ?] for instance, we're going to have to generate a mask of 1's. OK, so that's the very detailed. Yeah? AUDIENCE: So just maybe a little bit more detail about masking-- so if you say supports_masking is true like in the lower-left-hand corner, is that just using some default version of-- FRANCOIS CHOLLET: Yes, and the default is pass through. Yeah, so it is. If you said this, this enables your layer to use a default [INAUDIBLE] compute_mask, which just says return mask. So it get the input and a mask, and just returns the mask unchanged. AUDIENCE: So that assumes that the mask is like the first or second dimension gets masked? FRANCOIS CHOLLET: The first dimension gets masked if zero is the base dimension. AUDIENCE: And then where does this mask argument come from? Like if I look at the previous slide, it's not clear to me at all how this mask is being [INAUDIBLE].. FRANCOIS CHOLLET: So it is generated by the Embedding layer from the values of the integer inputs. AUDIENCE: Right, so Embedding layers has a compute_mask function? FRANCOIS CHOLLET: Yeah, which is exactly this one actually. AUDIENCE: And it returns a mask. FRANCOIS CHOLLET: Yes. AUDIENCE: So and somehow, the infrastructure knows to call-- because you enabled masking, it knows to call the compute_mask [INAUDIBLE].. FRANCOIS CHOLLET: Yes. AUDIENCE: The mask gets generated, but I don't know where it gets put. FRANCOIS CHOLLET: Where it gets put-- so that's actually something we're going to see in the next slide, which is a deep dive into what happens. When you're in the functional API, you have some inputs. You've created with Keras that inputs. And now, you're calling a layer on that. Well, the first thing we do is check whether all the inputs are actually symbolic inputs, like coming from this Input call. Because there's two ways you could use a layer. You could call it on the actual value tensors, like EagerTensors, in which case you're just going to run the layer like a function and return the outputs. Or you could call it symbolically, which is what happens in the functional API. Then you run pretty extensive checks about the shape and the type of your inputs to raise a handful of error messages in case of a mistake made by the user. Then you check if the layer is built. So the layer being built means its weights are already created. If the layer was not built, you're going to use the shape of the inputs to build the layer, so you call the build method. And after you've done that, you'll actually do a second round of input compatibility checks, because the input spec of the layer is quite likely to have changed during the build process. For instance, if you have a dense layer, when you instantiate it, before it knows its input shape, its input spec is just the fact that its inputs should have rank at least two. But after you've built the layer then you have an additional restriction, which is that now the last dimension of the inputs should have a specific value. Then the next step is you are going to check if you should be-- if you are in this case, if this layer expects a mask argument. And if it does, you're going to be fetching the masks generated by the parent layer. You have this compute mask method. AUDIENCE: So-- FRANCOIS CHOLLET: And-- AUDIENCE: And so where does that come from? I mean, like, somehow there is a secret communication generated. FRANCOIS CHOLLET: Yes, Which. Is a bit of metadata set on the tensor itself. There's an [INAUDIBLE] Keras mask property, which is the mask information. So it is the least error-prone to co-locate the mask information with the tensor that it refers to. AUDIENCE: So if you were to do something like-- I don't know, like a [INAUDIBLE] or something that's not a layer. FRANCOIS CHOLLET: Yes? AUDIENCE: Would-- you-- so you get a tensor that doesn't have a Keras mark argument. But then you say-- but I guess you also were going to it into a lambda layer. FRANCOIS CHOLLET: So what happens when you call ops that are not layers is that they get retroactively cast into layers, essentially. Like, we construct objects that are layers, that they are going to be internally calling these ops. But these automatical layers in the general case do not support masking. So if you do this, you are destroying the mask, and you're going to be passing your mask to a layer that does not support it, which is an error. So it's not going to-- AUDIENCE: It's not a silent-- FRANCOIS CHOLLET: It's not a silent failure. It's an error. If you pass a mask to a layer which is one of these automatic layers, in this case, that does not support masking, it's going to yell at you. AUDIENCE: So, but wait a minute. I feel like lots of layers are going to be like trivial pass-throughs. Like, if there's a mask, we want to pass it through, but if there's not a mask, that's fine. FRANCOIS CHOLLET: Yeah. So it has to be opt in, again, because any change to the time dimension of the inputs would need a smarter mask computation. And we cannot just always implicitly pass through the mask, because you don't actually know what the layer is doing. AUDIENCE: If you-- couldn't you change to implicit events through the mask as the shape or the outputs, the shape of the input? AUDIENCE: But what about something like a [INAUDIBLE]?? AUDIENCE: That has the same mask. That should actually respect the mask. FRANCOIS CHOLLET: That-- I think it's a reasonable default behavior. It's not going to work all the time, actually. You can think of [? adversile ?] counterexamples. AUDIENCE: [INAUDIBLE] FRANCOIS CHOLLET: But they're not like, common counterexamples. But yeah. So currently masking with the functional API is something that's only useful for built-in layers very much. So it's not really an issue we run into before. I like the fact that currently it's opt-in, because this actually saves us from precisely people generating a mask in semantic layer and then passing it to a custom layer or an automatically generated layer that does not support it. And that could potentially do things you don't expect, right? So it's better to-- AUDIENCE: And is the mask supported with an eager execution? FRANCOIS CHOLLET: Yes, it is. So in eager execution, essentially what happens is that the call method, which is programmatically generated for one of these functional API models, is basically going to call both-- [INAUDIBLE] the call method of layer and its compute mask method, and going to call the next layer with these arguments. So essentially, very much what you would be doing if you were to use masking using subclassing, which is basically, you call your layer. You get the outputs, your sublayer. I don't have a good example in here. So you call a layer, get these outputs. Then you generate the mask using compute masks, specifically. And for the next layer, you're going to be explicitly passing these [INAUDIBLE]. AUDIENCE: And is it is just keyword argument mask equals? FRANCOIS CHOLLET: Yes, that's right. AUDIENCE: So all these layers [INAUDIBLE].. So if I feel like a mask is going to [INAUDIBLE] they can always fix it by passing an explicit one? FRANCOIS CHOLLET: Yes, that's correct. You can pass arbitrary Boolean tensors well to deboolean tensors as the mask argument. So, in general, if you are doing subclassing, nothing is implicit, and you have freedom to do whatever you want, basically. AUDIENCE: Question about masking in the last-- if you have a [INAUDIBLE] [? static ?] case, like 3D tensor or trying to [INAUDIBLE] data, is it the mask that propagated it to the-- FRANCOIS CHOLLET: Yes, that's right. So in this example, for instance, our sequence direction, the CM layer, would have been destroying the mask. But if you were to remove these last two layers and just say this-- this [INAUDIBLE] layer that returns sequences is your last layer, and then you apply loss function that is sequence aware, then yes. The model, during fit, is going to generate a sample weight argument for the loss, which will incorporate the mask, meaning that any time step that is masked is going to receive a 2D sample weight of 0. So yes. So if you have something like, for instance, per time step classification, and you have time steps that are padded on masks, they will be completely ignored by the loss. So resuming, what happens when you call a layer on symbolic inputs? So after you've retrieved the mask, you've run the input compatibility checks, you've built the layer optionally, you're going to build a graph for the operations done by this layer. And have I mentioned two possibilities? Either the layer can be converted to a graph, which is the case 99% of the time. In that case, you're going to use autograph to generate an autographed division of the call method. This is useful in case the layer implementer is using [INAUDIBLE] control flow, like if and for loops, and so on. And then you're going to be calling these autographs version of call on your symbolic inputs. And in this call, you're going to be incorporating the mask argument, if it's present, and the training argument, if applicable. If the user has not passed an explicit training Boolean argument in this layer call, then it will default to the Keras learning phase tensor, which is a global symbolic tensor that we set its value when you call fit or evaluate. In the case-- there are case when the layer is declared to be dynamic, meaning that it cannot be converted to graph. This is the case, for instance, for a Tree-LSTM layer and so on. So it's very niche cases. In that case, you would expect the layer implementer to have implemented static shape inference to compute output shape method. It takes input shapes and return output shapes for this layer. And you're going to use this static shape inference method to generate new symbolic tensors with the correct shape and deduction. And you're going to be returning that. So once you've created the sub graph corresponding to this layer, you're going to create a node object, which is, again, a node in the graph of layer objects that the functional API builds. And you're going to set metadata on the outputs corresponding to the node, so essentially the layer call that created this-- these outputs, and also mask metadata. So that's it. It's actually a lot of features in one. So we talked about dynamic layers, layers that can be graphed and layers that cannot be graphed. So last week we saw an example of a very simple batch normalization layer implemented with subclassing. And in the call method, you had this if training conditional. And in training, you're just going to compute the mean and variance of the current batch. You're going to normalize the current batch with its own statistics. And you're going to be updating the moving variance and moving mean for your layer. In inference mode, however, you're going to be normalizing the current batch with the moving mean and moving variance. So there's actually a small subtlety with this layer, which makes it, fortunately, ungraphable, which is basically doing assignments of variables, so updates, in a branch of the conditional. This would be relatively easy to avoid-- AUDIENCE: Why is that a problem? FRANCOIS CHOLLET: So it's a problem because each branch in a TensorFlow control flow V2 is going to be run in different FuncGraph. And so-- AUDIENCE: [INAUDIBLE] control flow V2 is perfectly fine as is. AUDIENCE: It's because you only have an assign [? N1 ?] branch. If you had a corresponding [INAUDIBLE]---- AUDIENCE: No, no. Look at how [? the sign-on-- ?] AUDIENCE: There might be a bug, but it's a bug, if this-- AUDIENCE: Yeah. FRANCOIS CHOLLET: So, anyway, yeah, it-- AUDIENCE: It sounds-- FRANCOIS CHOLLET: --it's a bug. So-- AUDIENCE: If you replace the assign [INAUDIBLE] self, but add_update, then you have a problem because you could be looking at the answer from a control flow branch. But assign is safe. It's [INAUDIBLE]-- FRANCOIS CHOLLET: So last time I tried, this wasn't working. If you moved the assign outside of the if statement, it would be working. So, but anyway, let's say you have a Tree-LSTM layer. You cannot graph it. What do you do? You pass in the constructor. When you call super, you pass dynamic [INAUDIBLE].. This tells the framework that this layer cannot be graphed. And when using the functional API, it's never going to be used to build a graph. And when you call fit or evaluate, we are always going to run everything eagerly, even if you don't set explicitly running [INAUDIBLE].. One thing to keep in mind if you have dynamic layers is that if you want to use them in the functional API, you will have to implement a compute output check method to tell the framework how to do static shape inference with this layer. If you cannot do static shape inference, you cannot use your layer in the function, yeah, unfortunately. So if you try to use your layer in a symbolic way, it's going to [INAUDIBLE]. So that's it. And yeah. And so when you call fit, it's automatically going to be run eagerly. So of course, if you're not using the functional API and you're not using fit, then this dynamic argument is irrelevant to you. So let's talk a little bit about training and inference. So as we said last week, so compile is about basically configuring the training procedure. So it's all about building, essentially, an execution function, an execution graph. Meanwhile, fit is about running this execution function with new data over a data set. So what happens when you call compile? Of course, checking the values of the arguments by the user, because it's always important to do sanity checks and raise good error messages, if the user has made a mistake. Then we are going to look at the loss argument. We're going to map it to the different outputs. And not all outputs might be passed to loss. You're going to do the same for metrics as well. You're going to compute the total loss, which is the sum of the per output losses, and any loss that's being added during the forward pass. You're going to [INAUDIBLE] the trainable weights of the model and use the total loss and trainable weights and the optimizer that you passed to compute gradients. Then you're going to prepare the inputs, outputs, and updates for different execution functions, which are basically FuncGraphs. We have three different execution functions. You have the trained function, which takes input data and input targets and is going to run the backprop updates and the forward pass updates, and is going to return the loss and metric values. Then you have the eval function, which does the same, except without training any updates. So it's not doing backprop [INAUDIBLE].. It just takes input data, targets, and it comes loss and metrics. No updates. And finally, you have the predict function, which is like eval, except with different outputs. It's going to-- so it's not going to run any updates. It's taking the input data and targets and it's going to be returning the output of the model. It's not going to be attending the loss or the metrics. And the way these execution function are implemented in TensorFlow V2 is as FuncGraphs. When you are calling a layer symbolically, so in the functional API, that happens in a global Keras graph. And when you're creating an execution function, you're going to be creating a new scratch FuncGraph, and you're going to be doing a copy of the relevant subgraph from the global Keras graph to your new graph. And then that's this copy that you're going to be running. So that's-- if you're creating different models, the execution functions are all separate, living in different FuncGraphs. AUDIENCE: So here we have-- so it seems like there's some weird things where-- like for example, an optimizer is not used for eval and predict? What happens if you-- can you just not set an optimizer? FRANCOIS CHOLLET: So, yes, you can just not do compile, and then do predict. That works, because you don't need to compile information in order to run predict. However, you do need to compile your model if you want to run eval, because eval needs to compute loss and metrics, which are passed in compile. AUDIENCE: Well, I guess-- but what if you don't-- like, you're only doing eval. Do you need an optimizer? FRANCOIS CHOLLET: Technically, yes. I think it's possible you could pass non as the optimizer, and that would work. I don't recall. AUDIENCE: I think it's required. FRANCOIS CHOLLET: It's required? Yes, so in that case you can just pass a default optimizer, like this string LCD. It's going to work fine. AUDIENCE: OK, but I guess then it's going to create a bunch of graphs that I'm never going to use, but-- FRANCOIS CHOLLET: So the execution functions are actually created lazily. So they're not actually created in compile. It's a good mental model to have to think about creating in compile, if you want to think about it, but actually-- for instance, it's when you call fits that a train function is going to be created. It's when you call evaluate that eval function is going be created, and so on. So if you just instantiate your model, compile it, and call evaluate, you're not going to be creating any graph [INAUDIBLE] the optimizer, because you're only creating the eval function, which does not [INAUDIBLE].. AUDIENCE: I think it'd be [INAUDIBLE] optimizer equal to [? normal, ?] specifically. FRANCOIS CHOLLET: But it's kind of a niche use-case. So anyway, you can definitely instantiate a model and call predict without having called compile, and that totally works. AUDIENCE: Because-- are any of the-- so none of the arguments from compiler use are used for-- FRANCOIS CHOLLET: Are useful [INAUDIBLE].. That's right. AUDIENCE: OK. FRANCOIS CHOLLET: Because [INAUDIBLE] is just input to output mapping. AUDIENCE: But aren't outputs also prepared in compile? So just prepared [INAUDIBLE]? [INAUDIBLE] predict? AUDIENCE: [INAUDIBLE] when should people-- let's say they're using model [INAUDIBLE] predict versus just calling the model and then evaluate. FRANCOIS CHOLLET: So model that predicts is going to iterate over the data you passed in many batches and going to return them by arrays. Meanwhile, calling the model on an EagerTensor is the same as calling a layer, so it returns, directly, this value. If you have a single batch, there is no difference. AUDIENCE: In a lot of cases, if you call model on something, it goes through the eager path, whereas if you call predict it goes through this function path. And so if you're sensitive, essentially, to the graph execution versus the eager time, predict can be much faster. FRANCOIS CHOLLET: Yeah. But in terms of end output, if you have a single batch, there's no difference. But the big difference is that to predict, you could pass, for instance, a data set, right? Which you cannot pass in call. Yeah. So what happens now when you call fit? There's pretty extensive checking of the user-provided data, as usual, checking that the correct data is being passed, that there's correct shapes or rank and so on. So optionally, we can set aside the validation splits. That's only true if the input data is a numpy data or EagerTensors. That's not possible if you pass a data set or pattern generator. Then we prepare the callbacks. So importantly, everything that happens dynamically during training, apart from executing the graph functions, is structured as a callback. In particular, the logging that we do internally and the display that we do of the progress bar, these are all callbacks. And if the user also wants to do some action dynamically during training, they have to implement callback. AUDIENCE: But how often are callbacks called? FRANCOIS CHOLLET: So they're called at different point during training. So a callback implements the methods on_train_begin and on_train_end, which are called at the very beginning before training and after training is over. Then for each epoch you have a method on_epoch_begin and on_epoch_end, so called before the start of an epoch and after the epoch is finished. And finally, there is batch-level methods. There is on_batch_begin and on_batch_end. AUDIENCE: But is the infrastructure able to know whether a particular callback is implemented on batch beginning, on batch end, to avoid maybe a per batch overhead? FRANCOIS CHOLLET: So if the method is not implemented, there is virtually no overhead to coordinate. And we need to call it, anyway, for things like the [INAUDIBLE] aggregator and so on. AUDIENCE: For what? FRANCOIS CHOLLET: The logging. The logging callbacks, basically. AUDIENCE: I guess in cases where we're trying to put the whole training on a device, for example, or we're trying to do, say, remote execution-- FRANCOIS CHOLLET: Yeah. AUDIENCE: Where per batch, execution might be-- FRANCOIS CHOLLET: Yeah. So one thing we've considered-- this is a topic that has come before with TPUs and so on. One thing we've considered is having a global-- well, an argument in fit, or something, that specifies how often batch-level callbacks have been recalled, like for instance every 30 batches and so on. AUDIENCE: So the batch-- if a batch argument is only called every 30 batches, or something like that, is that going to-- I mean, how does it work? Does the-- are callbacks expecting to see something every batch? Are they going to still work if they're called every 30 batches? FRANCOIS CHOLLET: Typically that's still going to work. The way it's going to work is that from the perspective of your callback, it's going to be the same as if your batches were 30 times bigger. AUDIENCE: So the batch recaller could be called [INAUDIBLE] batches? AUDIENCE: Is it-- FRANCOIS CHOLLET: Yes. AUDIENCE: Does it get the output? AUDIENCE: When? FRANCOIS CHOLLET: This is speculative API, right? We don't have this to implemented, but it is one way we've considered handling the fact that you're probably going to be wanting, at some point, to do processing of multiple batches at once on device with no contact with the host. AUDIENCE: So I guess, could you just tell us what on_batch_begin arguments are? Like, what sort of information is passed at this point? FRANCOIS CHOLLET: Right, so it receives a log dictionary. It receives the index of the batch as well. The log dictionary contains the current value of the total loss and the metrics. AUDIENCE: So that-- the loss from the last batch, or the loss [INAUDIBLE]? FRANCOIS CHOLLET: So that's the loss from the last batch, and that's the other metrics, the moving value of the metrics. AUDIENCE: But even if we change the callbacks, I think-- and as long as the loop itself is in Python, then it doesn't really help, right? Even if you're not calling a callback, as long as a loop is still in Python, it doesn't really-- AUDIENCE: [INAUDIBLE] try to turn the loop the loop into a pass function? AUDIENCE: Yeah, I think that would be required as well. AUDIENCE: [INAUDIBLE] the expectation is that the callbacks are operating numpy [INAUDIBLE] not tensors. AUDIENCE: And I think we need to change the APIs so that the callbacks are operating [INAUDIBLE] so that we-- that would work. AUDIENCE: I mean, I think in a perfect world, we would use a mixture of Python and py function to essentially only run [? py ?] in Python the parts that we want, while still keeping the outer part. But I mean, we're not there yet. AUDIENCE: And since callbacks are passed down through this, can I rely on the sequence of how they're passed? Is it the sequence how they're going to pass again? FRANCOIS CHOLLET: Yes, so the sequence in which they are called matches the order in which they're passed in the callbacks list that you pass to fit, meaning that it is possible for one of your callbacks to add some values to the log dictionary. It's the same log dictionary that's going to be seen by the callbacks after that. It's possible to do cascading processing methods. AUDIENCE: And the default callback, the progress bar and stuff are called at the end? FRANCOIS CHOLLET: The progress bar and logging stuff is called at the very end, meaning that if you add stuff to the log dictionary, it's going to be displayed. But the very first callback that's passed is actually the callback that starts populating the logs. AUDIENCE: There's also a slight bit of nuance in that the model itself is set as an attribute on callbacks, and so any attempt to, essentially, optimize this or put this in a tf.function scope would have to be aware of the fact that it's a legitimate use case for a callback to start accessing arbitrary parts of the model. AUDIENCE: Right. I guess my question is to more about running callbacks less frequently, not about trying to run them inside a function. FRANCOIS CHOLLET: OK. So our last topic is losses and metrics. The losses are very simple. So they're just subclasses of this loss base class. They just have to implement one method, which is call, just like layers. The signatures would be different. Yeah, the signatures are [INAUDIBLE] y_true, y_pred. Optionally, you can also have sample weights. Yeah. And you just return a loss. So importantly, as a convention, the loss return should be one loss value per sample. So here, for instance, we are reducing on the loss axis, but we are not returning a scalar. Metrics are a bit more involved. There's three different methods you should be implementing in your metrics. You have this update state method, which is very similar to the call method for losses, so y_true, y_pred, sample weight in the signature. And the difference is that you're not returning a value. You're just updating the internal state of your metric. So in this case, your internal state is just this one scalar weight called true positives. So here, you are just adding to this value. Then the second method you have to implement is results, which, as the name indicates, just returns the current value for this metric. And finally, you need a method to reinitialize the state of your metric. And what happens, for instance, when you call fit is that at every batch, we're going to be updating the state of the metrics, including this update state method. When you want to report a value, which-- well, typically it's also after every batch, we call this results method. And at the end of an epoch, we want to reset the metrics, so we call reset states. And the way you specify metrics in the compile API is basically you just have this metrics document and compile mistakes, a list of these metric instances. So if you have metrics with signatures or requirements that do not match this API, that you can update the state based on y_true, y_pred, and some simple weights, for instance. One thing you can do is write a layer that will-- inside its call method, that we call self.add_metric. We just call our tensor. And that enables you-- there is two arguments in two pass. In each pass, name, because that's what's going to be reported with the progress bar on your logs, and so on, and then aggregation argument, which tells the framework how the-- so you can assume that this is basically called for every batch. So how does these different values for every batch get aggregated into a single scalar, right? And then in the functional API, you can use this layer like this. You insert it at some point, and just returns the same as its input tensors, which you can keep using. And when you do that, you get a model with a forward pass that's going to be calling this add_metric at every batch. AUDIENCE: Are there any standard, predefined versions of metric log layers? FRANCOIS CHOLLET: No, but it looks just like this. So there's nothing special you have to do, apart from implementing call with add_metric in it. AUDIENCE: How does that work in eager execution? FRANCOIS CHOLLET: In eager execution, this is literally called at every batch. AUDIENCE: So add a new metric at every batch, or if two people created metrics at different-- with different metrics the same name? Can you cache that? AUDIENCE: We keep track with the metric, with the name, and just call the same metric again. So-- AUDIENCE: Actually, so if you have two layers and they call that metric, and they both use the same name, then you're going to have a collision? AUDIENCE: We raise an error in that case. AUDIENCE: Oh, so like-- AUDIENCE: We can detect. AUDIENCE: You've just detected the layer? But if the layer has two metrics of the same name-- AUDIENCE: Same-- AUDIENCE: --assume that's intentional? FRANCOIS CHOLLET: So essentially, in one forward pass, you cannot have the same name twice. But across different forward passes, so across different calls of your metric-- AUDIENCE: [INAUDIBLE] FRANCOIS CHOLLET: --of your model, you're going to be aggregating the metrics based on the name. AUDIENCE: So there is some state that gets reset? FRANCOIS CHOLLET: Yes, that's right. AUDIENCE: So is this similar to the thing where layers know if they're being called by another layer? FRANCOIS CHOLLET: Yes, it's very similar. It's basically call context. And at the end of the call context, you reset the state. So you reset the losses that were created during the forward pass, and you reset the state of the symmetric aggregation thing. Right. But what if this is not actually enough for you, because you want a metric that not only sees the input of your model, but also the targets? Which is actually relatively common. One thing you could use-- so, of course, if you're just doing a model subclassing or writing your own custom training loops, you have no restrictions whatsoever, so it's not relevant to you. But what if you really want to use fit with these very arbitrary metrics? Well, one thing you can do is the endpoint layer pattern. So how does it work? It's basically a layer that's in the functional API you would put at the very end of your model, so it computes predictions, right? And it's going to take as input whatever you want. In our case, it's going to take the targets of the models and logits generated by this dense layer here. And the targets are an input, right, created here. And then what is it going to do with inputs and targets? It's going to compute a loss value so it returns a scalar, in this case, because we call it within this [INAUDIBLE] call, which is different from the plain call method. So it's automatically reduced in this case. You can add whatever metrics you want. Note that if you add the training argument here, you could be logging different metrics in training and inference. And finally, you return what you would want the predict method to return, so a softmax. AUDIENCE: Well, it seems to me like you can't use this with inference at all. FRANCOIS CHOLLET: So in inference, you would be using the same layer, for instance. But what you can do is you have to rewire your model, because you're not going to have these targets inputs. But you can use the same layer, and the only thing you need to be mindful of is that you should do a conditional check on the inputs that were passed to see if there is a target key in there or not. Or you could rely on the training argument. Also works. AUDIENCE: So you're saying that if you create this model like you show it in there, and then I call and model that predict on it-- FRANCOIS CHOLLET: So if you just reuse this model, this is a model that's already functioning. If you reuse it for predict, you're going to have to pass some value-- some dummy value, for the targets. So if you want to not have to pass dummy value for targets, you need to redefine a new model that is going to be using the same layers. The only difference is that it's not going to instantiate this target's input object and it's not going to be passing the targets to the LogisticEndpoints. Any logistic endpoint is going to ignore-- it's not going to attempt to access the target's [? key. ?] Yeah. So when you instantiate this model like that, you state, it starts from inputs and targets. It returns predictions. And when you fit it, you fit it with a dictionary or a data set with returns a dictionary, and you include the target's data in that dictionary. So like this, right? And when you compile it, you are not going to specify any loss in compile, because the loss is added entirely inside that endpoint layer. So you just specify the optimizer. AUDIENCE: So is the name in, like-- when you define the inputs and [? endpoint ?] for target, is the name in there supposed to match the key in the dictionary? FRANCOIS CHOLLET: Yes. So when you call this layer, you're passing this dict. This dict is-- AUDIENCE: No, not the LogisticEndpoint. I mean the input layer. How does [INAUDIBLE] note as inputs, inputs and targets, the target instead of the other way around? FRANCOIS CHOLLET: So, OK. So it's a bit confusing because we have a naming collision here, but in your data dictionary, the keys are supposed to match the names that you give to your inputs. AUDIENCE: OK. FRANCOIS CHOLLET: So here, you're creating inputs that's named inputs and you'd create targets that's named targets. And this is what these keys here are referring to. So yeah, to deconfuse, we could have chosen another choice of name for the inputs to the [? LogisticEndpoint ?] layer. AUDIENCE: Question. FRANCOIS CHOLLET: Yeah. AUDIENCE: Is it possible-- like, in here, you are declaring the loss and adding loss on the layer level. Is it possible to just declare the loss and then use the loss at add_metric, and then don't do softmax, and then just return the y_true and y_predict, just pass this through? Like, later on, like, you still can declare a loss on the model level? AUDIENCE: So that-- I don't think there's a way to tell Keras that one of these outputs of the model is the labels. FRANCOIS CHOLLET: Yeah. So if you want to pass your loss in compile, it means that the targets are going to be passed separately in fit. And it means that your loss should match the signature y_true, y_pred, sample weight. So if you have anything that's outside of this template, you actually do need to use add_loss and to use this pattern, which is very general. Like, there's literally nothing you cannot implement with this pattern. And again, the purpose of this pattern is to make it possible to use fit, the plain call to fit, with complicated loss and metric setups. Yeah. So what if you have a use case that's even more complicated and you still want to use fit? Well, the answer is don't, because fit is really this built-in thing that's not meant for every use case. It's meant for common use cases. If you need more flexibility than what's provided by this endpoint pattern, for instance, you should really write your own loops, which is not difficult at all. So that's it for the overview of Keras internals. And I see we actually spent like 45 minutes. [LAUGHTER] We did. So it's nice that we had the two full sessions. Yeah, so thank you very much for listening and for the interesting questions. AUDIENCE: Thank you. This was very informative. AUDIENCE: Yeah, thank you. [APPLAUSE] AUDIENCE: So could you highlight-- you talked a lot about how everything works in functional models. FRANCOIS CHOLLET: Yeah. AUDIENCE: Could you highlight just any differences with subclass models or sequential models or anything like that? Because I know for sequential models, for example-- FRANCOIS CHOLLET: So-- AUDIENCE: --build happens at a different point than-- FRANCOIS CHOLLET: Yes. So I covered writing things from scratch towards the beginning, but basically, there is very, very little to know about it, because everything is explicit. And you are in charge of doing everything. So because of that, there's not much happening under the hood. There is virtually no hood. But yeah, basically the first 10 slides in this presentation covered pretty much everything you need to know. AUDIENCE: I guess it felt like it covered the difference in [INAUDIBLE] perspective, but is there anything, like things that we might expect to be true about models, like that they're already built by the time they get to fit or something like that that may not always be true for a sequential model [? or something? ?] FRANCOIS CHOLLET: So if you're using a subclass model and you're using fit, one thing to keep in mind, actually, is that when you call fit, the model is not built, so the framework is going to be looking at the input data you pass to fit and going to assume that you made no mistake, and that the model expects exactly the structure of input data, and is going to use that to build a model. AUDIENCE: OK. So it calls build? AUDIENCE: So build happens in fit in, for example, class-- AUDIENCE: And-- AUDIENCE: --models? FRANCOIS CHOLLET: So that's only if you're using a subclass model plus fit. If you're using a subclass model plus an assumption look like, yeah, this one, for instance, there's really nothing special in it, you know? AUDIENCE: OK. AUDIENCE: Is the role of train and batch, test and batch, and evaluate and batch relating to the last slide, where you said that if you do something complicated, write your own rules? FRANCOIS CHOLLET: Yeah. AUDIENCE: Is that why they exist? FRANCOIS CHOLLET: Yes. So the-- these are ways to run the train, eval, operate execution functions for a single batch. And they are useful if you want to customize your training loop, but you don't actually need the level of flexibility that's provided by the gradient, for instance. Essentially, you're going to be using that if you're doing gains, for instance, or if you're doing reinforcement learning. Manually computing the gradients with the gradient type and so on is mostly useful if either you want to understand every step of what you're doing and not delegate anything to the framework, or if you need to do some manipulation on the gradients themselves, like some form of gradient normalization that's not covered by the optimizer API, or you want, maybe, to actually have your gradients be generated by a different model, like the actual gradients. Yeah. These type of advanced use cases involve manipulating the gradients. AUDIENCE: Could you go back to the slide where you showed the subclass model? FRANCOIS CHOLLET: Which one? This subclass model? Let's see. This one? AUDIENCE: Sure. So this doesn't actually define the build? FRANCOIS CHOLLET: So you don't need to, because would only be to create variables. For this model, that needs to know about the shape of the inputs. If you don't need to know about the shape of the inputs, you can just create them in the constructor. And in this case, this model has no variables of its own. The only variables of this model come from the underlying layers. And when you call the layer for the first time, which is going to happen in fit, you're going to be calling this call method, which in turn is going to be calling the underlying layers. And that's when they are called, that internally they're going to be calling the build method. AUDIENCE: I had this impression that-- Igor and I came up with this case where build was non-trivial, even for a subclass model? AUDIENCE: I think you're referring to, like-- I think there is a place where you call that model with like default inputs, if those inputs are of such type that can have default values, like a dummy call-- [INTERPOSING VOICES] AUDIENCE: --and a build graphing? FRANCOIS CHOLLET: Yeah, so that's that specifically when you call fit on the subclass model that has never been built before. In that case, the framework is going to make some assumptions, some inference, based on the data you pass to fit. If you want to avoid that, you can either explicitly implement a build method and call it, or you could call your MLP instance here on one EagerTensor once. And that's going to be sufficient to build the model, because it's going to run this call method. Well, first of all, it's going to call the build method if it's implemented, so yeah, it doesn't do anything. Then it's going to call the plain call method, and in doing so it's going to call-- [INAUDIBLE] call for all the layers. Each layer is going to call build, then call the plain call. AUDIENCE: Do [INAUDIBLE] users use things like add_loss and add_metrics even if they're doing a custom training model? FRANCOIS CHOLLET: Yes, that's totally fine. So using add_loss with custom training loops works like this. So after your forward pass, model the classes has been populated. And so you can just add the sum of these losses to your main loss value. For metrics, you're going to have to query model.metrics and look at the names. And so for metrics, if you're writing a custom training loop, it's actually easier to do every step manually. So start by instantiating your metrics outside of the loop. Then inside the loop, for each batch, you call update. Then at the end, or at the end of an epoch, you call results, and you log that. Add_metric is usable with custom training loops, but it's not very ergonomic. Add_loss is very ergonomic with custom training loops. AUDIENCE: Are all the bias in model loss scalars or per batch? Or-- FRANCOIS CHOLLET: They're all scalars. We don't support non-scalar losses. Correct. AUDIENCE: On the TensorFlow [INAUDIBLE] how many can I access through Keras core layers? And then, [INAUDIBLE] to be developing new core layers, too? FRANCOIS CHOLLET: What would be an example of-- AUDIENCE: For example, this [INAUDIBLE] in TensorFlow. FRANCOIS CHOLLET: Yeah. AUDIENCE: They're not used by Keras? FRANCOIS CHOLLET: So Keras is meant to support a number of sparse operations. For instance, the dense layer is meant to support SparseTensors. I'm not sure if it's true today, but it's definitely supposed to work. Other layers, maybe-- yeah. I think it's mainly the dense layer. Maybe activation is one. But yeah, the large majority of layers, like LSTM, [? carve, ?] [? pulling, ?] and so on, they're only for dense data. Dense flow data.
B1 francois layer inaudible mask call model Inside TensorFlow: tf.Keras (part 2) 5 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary