Placeholder Image

Subtitles section Play video

  • FRANCOIS CHOLLET: So last time, we

  • talked about a bunch of things.

  • We talked about the functional API

  • for building types of layers.

  • We talked about features that are specific to the functional

  • API--

  • things like static input compatibility checks

  • across layers every time you call

  • a layer, whole-model saving, model plotting,

  • and visualization.

  • We talked about how masking works in the functional API

  • and about how masks are propagated.

  • So for instance in this example, the Embedding layer is going

  • to be generating a mask here because you passed this

  • argument mask_zero=True.

  • And this mask is going to be passed

  • to every subsequent layer.

  • And in particular, to layers that consume the mask,

  • like this LSTM layer here.

  • However, in this LSTM layer, because it's

  • a sequence reduction layer, it's not

  • going to be returning all the sequences, but only

  • the last output.

  • This is going to destroy the mask,

  • and so this layer is not going to see a mask anymore.

  • So it's really a way to handle masking

  • that works basically magically.

  • Masking is pretty advanced, so most people who need it

  • are not actually going to understand

  • very well how it works.

  • And the idea is to enable them to benefit

  • from masking by just doing, like, hey, this Embedding

  • layer, I want the zeros to mean this is masked,

  • and then everything in their network

  • is going to magically know about this as long as they

  • are using built-in layers.

  • If you're an implementer of layers,

  • you actually need to understand how it works.

  • First of all, you need to understand

  • what you should be doing if you're

  • writing a layer that consumes a mask, like an LSTM layer,

  • for instance.

  • It's very simple.

  • You just make sure you have this mask argument in the signature

  • and expects a structure of tensors

  • that's going to match the structure of your inputs.

  • So it's because it's going to be a single tensor.

  • And the single tensor is going to be a Boolean tensor, where

  • you have one mask entry per timestep per sample.

  • So it's typically a 2D Boolean tensor.

  • If you have a layer that can safely pass through a mask,

  • for instance, a dense layer or in general any layer that

  • does not affect the time dimension of its inputs,

  • you can just enable your layer to pass through its mask

  • by saying it supports_masking.

  • It is opt in because a lot of the time,

  • layers might be affecting the time dimension

  • of the inputs, in which case the meaning of the mask

  • would be changed.

  • And if you do have a layer that changes the time

  • dimension or otherwise a layer that creates a mask from input

  • values, it's going to need to implement this compute_mask

  • method that is going to receive the mask, receive the inputs.

  • If, for instance, you have an Embedding layer,

  • it's going to be doing this--

  • not_equal inputs, 0.

  • So it's going to be using the input values

  • to generate a Boolean mask.

  • If you have a concatenate layer, for instance,

  • it's not going to be looking at the input values,

  • but it needs to look at the masks--

  • the two mask that's being passed--

  • and concatenate them.

  • And if one of the masks is [? non, ?] for instance,

  • we're going to have to generate a mask of 1's.

  • OK, so that's the very detailed.

  • Yeah?

  • AUDIENCE: So just maybe a little bit more detail about masking--

  • so if you say supports_masking is true like

  • in the lower-left-hand corner, is that just using some default

  • version of--

  • FRANCOIS CHOLLET: Yes, and the default is pass through.

  • Yeah, so it is.

  • If you said this, this enables your layer

  • to use a default [INAUDIBLE] compute_mask, which

  • just says return mask.

  • So it get the input and a mask, and just

  • returns the mask unchanged.

  • AUDIENCE: So that assumes that the mask

  • is like the first or second dimension gets masked?

  • FRANCOIS CHOLLET: The first dimension

  • gets masked if zero is the base dimension.

  • AUDIENCE: And then where does this mask argument come from?

  • Like if I look at the previous slide,

  • it's not clear to me at all how this mask is being [INAUDIBLE]..

  • FRANCOIS CHOLLET: So it is generated

  • by the Embedding layer from the values of the integer inputs.

  • AUDIENCE: Right, so Embedding layers has a compute_mask

  • function?

  • FRANCOIS CHOLLET: Yeah, which is exactly this one actually.

  • AUDIENCE: And it returns a mask.

  • FRANCOIS CHOLLET: Yes.

  • AUDIENCE: So and somehow, the infrastructure

  • knows to call-- because you enabled masking,

  • it knows to call the compute_mask [INAUDIBLE]..

  • FRANCOIS CHOLLET: Yes.

  • AUDIENCE: The mask gets generated,

  • but I don't know where it gets put.

  • FRANCOIS CHOLLET: Where it gets put--

  • so that's actually something we're

  • going to see in the next slide, which is

  • a deep dive into what happens.

  • When you're in the functional API, you have some inputs.

  • You've created with Keras that inputs.

  • And now, you're calling a layer on that.

  • Well, the first thing we do is check

  • whether all the inputs are actually symbolic inputs,

  • like coming from this Input call.

  • Because there's two ways you could use a layer.

  • You could call it on the actual value tensors,

  • like EagerTensors, in which case you're just

  • going to run the layer like a function

  • and return the outputs.

  • Or you could call it symbolically,

  • which is what happens in the functional API.

  • Then you run pretty extensive checks

  • about the shape and the type of your inputs

  • to raise a handful of error messages

  • in case of a mistake made by the user.

  • Then you check if the layer is built. So the layer being built

  • means its weights are already created.

  • If the layer was not built, you're

  • going to use the shape of the inputs to build the layer,

  • so you call the build method.

  • And after you've done that, you'll

  • actually do a second round of input compatibility checks,

  • because the input spec of the layer

  • is quite likely to have changed during the build process.

  • For instance, if you have a dense layer, when

  • you instantiate it, before it knows its input shape,

  • its input spec is just the fact that its inputs

  • should have rank at least two.

  • But after you've built the layer then

  • you have an additional restriction,

  • which is that now the last dimension of the inputs

  • should have a specific value.

  • Then the next step is you are going

  • to check if you should be--

  • if you are in this case, if this layer expects a mask argument.

  • And if it does, you're going to be fetching the masks generated

  • by the parent layer.

  • You have this compute mask method.

  • AUDIENCE: So--

  • FRANCOIS CHOLLET: And--

  • AUDIENCE: And so where does that come from?

  • I mean, like, somehow there is a secret communication generated.

  • FRANCOIS CHOLLET: Yes, Which.

  • Is a bit of metadata set on the tensor itself.

  • There's an [INAUDIBLE] Keras mask property,

  • which is the mask information.

  • So it is the least error-prone to co-locate the mask

  • information with the tensor that it refers to.

  • AUDIENCE: So if you were to do something like--

  • I don't know, like a [INAUDIBLE] or something

  • that's not a layer.

  • FRANCOIS CHOLLET: Yes?

  • AUDIENCE: Would-- you-- so you get a tensor that doesn't

  • have a Keras mark argument.

  • But then you say-- but I guess you also were going

  • to it into a lambda layer.

  • FRANCOIS CHOLLET: So what happens

  • when you call ops that are not layers

  • is that they get retroactively cast into layers, essentially.

  • Like, we construct objects that are layers,

  • that they are going to be internally calling these ops.

  • But these automatical layers in the general case

  • do not support masking.

  • So if you do this, you are destroying the mask,

  • and you're going to be passing your mask

  • to a layer that does not support it, which is an error.

  • So it's not going to--

  • AUDIENCE: It's not a silent--

  • FRANCOIS CHOLLET: It's not a silent failure.

  • It's an error.

  • If you pass a mask to a layer which

  • is one of these automatic layers,

  • in this case, that does not support masking,

  • it's going to yell at you.

  • AUDIENCE: So, but wait a minute.

  • I feel like lots of layers are going to be

  • like trivial pass-throughs.

  • Like, if there's a mask, we want to pass it through,

  • but if there's not a mask, that's fine.

  • FRANCOIS CHOLLET: Yeah.

  • So it has to be opt in, again, because any change to the time

  • dimension of the inputs would need a smarter mask

  • computation.

  • And we cannot just always implicitly pass through

  • the mask, because you don't actually know what the layer is

  • doing.

  • AUDIENCE: If you-- couldn't you change to implicit events

  • through the mask as the shape or the outputs,

  • the shape of the input?

  • AUDIENCE: But what about something like a [INAUDIBLE]??

  • AUDIENCE: That has the same mask.

  • That should actually respect the mask.

  • FRANCOIS CHOLLET: That-- I think it's a reasonable default

  • behavior.

  • It's not going to work all the time, actually.

  • You can think of [? adversile ?] counterexamples.

  • AUDIENCE: [INAUDIBLE]

  • FRANCOIS CHOLLET: But they're not like,

  • common counterexamples.

  • But yeah.

  • So currently masking with the functional API

  • is something that's only useful for built-in layers very much.

  • So it's not really an issue we run into before.

  • I like the fact that currently it's

  • opt-in, because this actually saves us

  • from precisely people generating a mask in semantic layer

  • and then passing it to a custom layer

  • or an automatically generated layer that does not support it.

  • And that could potentially do things you don't expect, right?

  • So it's better to--

  • AUDIENCE: And is the mask supported

  • with an eager execution?

  • FRANCOIS CHOLLET: Yes, it is.

  • So in eager execution, essentially

  • what happens is that the call method, which

  • is programmatically generated for one of these functional API

  • models, is basically going to call both--

  • [INAUDIBLE] the call method of layer and its compute mask

  • method, and going to call the next layer

  • with these arguments.

  • So essentially, very much what you

  • would be doing if you were to use

  • masking using subclassing, which is basically,

  • you call your layer.

  • You get the outputs, your sublayer.

  • I don't have a good example in here.

  • So you call a layer, get these outputs.

  • Then you generate the mask using compute masks, specifically.

  • And for the next layer, you're going to be explicitly passing

  • these [INAUDIBLE].

  • AUDIENCE: And is it is just keyword argument mask equals?

  • FRANCOIS CHOLLET: Yes, that's right.

  • AUDIENCE: So all these layers [INAUDIBLE]..

  • So if I feel like a mask is going to [INAUDIBLE]

  • they can always fix it by passing an explicit one?

  • FRANCOIS CHOLLET: Yes, that's correct.

  • You can pass arbitrary Boolean tensors well to deboolean

  • tensors as the mask argument.

  • So, in general, if you are doing subclassing,

  • nothing is implicit, and you have

  • freedom to do whatever you want, basically.

  • AUDIENCE: Question about masking in the last--

  • if you have a [INAUDIBLE] [? static ?] case,

  • like 3D tensor or trying to [INAUDIBLE] data,

  • is it the mask that propagated it to the--

  • FRANCOIS CHOLLET: Yes, that's right.

  • So in this example, for instance,

  • our sequence direction, the CM layer,

  • would have been destroying the mask.

  • But if you were to remove these last two layers

  • and just say this-- this [INAUDIBLE]

  • layer that returns sequences is your last layer,

  • and then you apply loss function that is sequence aware, then

  • yes.

  • The model, during fit, is going to generate a sample weight

  • argument for the loss, which will incorporate the mask,

  • meaning that any time step that is masked

  • is going to receive a 2D sample weight of 0.

  • So yes.

  • So if you have something like, for instance,

  • per time step classification, and you

  • have time steps that are padded on masks,

  • they will be completely ignored by the loss.

  • So resuming, what happens when you call

  • a layer on symbolic inputs?

  • So after you've retrieved the mask,

  • you've run the input compatibility checks,

  • you've built the layer optionally,

  • you're going to build a graph for the operations done

  • by this layer.

  • And have I mentioned two possibilities?

  • Either the layer can be converted to a graph, which

  • is the case 99% of the time.

  • In that case, you're going to use autograph

  • to generate an autographed division of the call method.

  • This is useful in case the layer implementer is

  • using [INAUDIBLE] control flow, like if and for loops,

  • and so on.

  • And then you're going to be calling these autographs

  • version of call on your symbolic inputs.

  • And in this call, you're going to be incorporating the mask

  • argument, if it's present, and the training

  • argument, if applicable.

  • If the user has not passed an explicit training Boolean

  • argument in this layer call, then it

  • will default to the Keras learning phase tensor, which

  • is a global symbolic tensor that we set its value when

  • you call fit or evaluate.

  • In the case-- there are case when the layer is declared

  • to be dynamic, meaning that it cannot be converted to graph.

  • This is the case, for instance, for a Tree-LSTM layer

  • and so on.

  • So it's very niche cases.

  • In that case, you would expect the layer implementer

  • to have implemented static shape inference

  • to compute output shape method.

  • It takes input shapes and return output shapes for this layer.

  • And you're going to use this static shape inference

  • method to generate new symbolic tensors with the correct shape

  • and deduction.

  • And you're going to be returning that.

  • So once you've created the sub graph corresponding

  • to this layer, you're going to create a node object, which

  • is, again, a node in the graph of layer objects

  • that the functional API builds.

  • And you're going to set metadata on the outputs corresponding

  • to the node, so essentially the layer call that created this--

  • these outputs, and also mask metadata.

  • So that's it.

  • It's actually a lot of features in one.

  • So we talked about dynamic layers,

  • layers that can be graphed and layers that cannot be graphed.

  • So last week we saw an example of a very simple batch

  • normalization layer implemented with subclassing.

  • And in the call method, you had this if training conditional.

  • And in training, you're just going

  • to compute the mean and variance of the current batch.

  • You're going to normalize the current batch

  • with its own statistics.

  • And you're going to be updating the moving variance

  • and moving mean for your layer.

  • In inference mode, however, you're

  • going to be normalizing the current batch with the moving

  • mean and moving variance.

  • So there's actually a small subtlety

  • with this layer, which makes it, fortunately, ungraphable,

  • which is basically doing assignments

  • of variables, so updates, in a branch of the conditional.

  • This would be relatively easy to avoid--

  • AUDIENCE: Why is that a problem?

  • FRANCOIS CHOLLET: So it's a problem

  • because each branch in a TensorFlow control flow V2

  • is going to be run in different FuncGraph.

  • And so--

  • AUDIENCE: [INAUDIBLE] control flow

  • V2 is perfectly fine as is.

  • AUDIENCE: It's because you only have an assign [? N1 ?] branch.

  • If you had a corresponding [INAUDIBLE]----

  • AUDIENCE: No, no.

  • Look at how [? the sign-on-- ?]

  • AUDIENCE: There might be a bug, but it's a bug, if this--

  • AUDIENCE: Yeah.

  • FRANCOIS CHOLLET: So, anyway, yeah, it--

  • AUDIENCE: It sounds--

  • FRANCOIS CHOLLET: --it's a bug.

  • So--

  • AUDIENCE: If you replace the assign [INAUDIBLE] self,

  • but add_update, then you have a problem

  • because you could be looking at the answer from a control flow

  • branch.

  • But assign is safe.

  • It's [INAUDIBLE]--

  • FRANCOIS CHOLLET: So last time I tried, this wasn't working.

  • If you moved the assign outside of the if statement,

  • it would be working.

  • So, but anyway, let's say you have a Tree-LSTM layer.

  • You cannot graph it.

  • What do you do?

  • You pass in the constructor.

  • When you call super, you pass dynamic [INAUDIBLE]..

  • This tells the framework that this layer cannot be graphed.

  • And when using the functional API,

  • it's never going to be used to build a graph.

  • And when you call fit or evaluate,

  • we are always going to run everything eagerly,

  • even if you don't set explicitly running [INAUDIBLE]..

  • One thing to keep in mind if you have dynamic layers

  • is that if you want to use them in the functional API,

  • you will have to implement a compute output check

  • method to tell the framework how to do static shape

  • inference with this layer.

  • If you cannot do static shape inference,

  • you cannot use your layer in the function, yeah, unfortunately.

  • So if you try to use your layer in a symbolic way,

  • it's going to [INAUDIBLE].

  • So that's it.

  • And yeah.

  • And so when you call fit, it's automatically

  • going to be run eagerly.

  • So of course, if you're not using the functional API

  • and you're not using fit, then this dynamic argument

  • is irrelevant to you.

  • So let's talk a little bit about training and inference.

  • So as we said last week, so compile

  • is about basically configuring the training procedure.

  • So it's all about building, essentially,

  • an execution function, an execution graph.

  • Meanwhile, fit is about running this execution function

  • with new data over a data set.

  • So what happens when you call compile?

  • Of course, checking the values of the arguments by the user,

  • because it's always important to do sanity checks

  • and raise good error messages, if the user has made a mistake.

  • Then we are going to look at the loss argument.

  • We're going to map it to the different outputs.

  • And not all outputs might be passed to loss.

  • You're going to do the same for metrics as well.

  • You're going to compute the total loss, which

  • is the sum of the per output losses,

  • and any loss that's being added during the forward pass.

  • You're going to [INAUDIBLE] the trainable weights

  • of the model and use the total loss

  • and trainable weights and the optimizer

  • that you passed to compute gradients.

  • Then you're going to prepare the inputs, outputs, and updates

  • for different execution functions, which

  • are basically FuncGraphs.

  • We have three different execution functions.

  • You have the trained function, which

  • takes input data and input targets

  • and is going to run the backprop updates and the forward pass

  • updates, and is going to return the loss and metric values.

  • Then you have the eval function, which does the same,

  • except without training any updates.

  • So it's not doing backprop [INAUDIBLE]..

  • It just takes input data, targets,

  • and it comes loss and metrics.

  • No updates.

  • And finally, you have the predict function,

  • which is like eval, except with different outputs.

  • It's going to-- so it's not going to run any updates.

  • It's taking the input data and targets

  • and it's going to be returning the output of the model.

  • It's not going to be attending the loss or the metrics.

  • And the way these execution function

  • are implemented in TensorFlow V2 is as FuncGraphs.

  • When you are calling a layer symbolically,

  • so in the functional API, that happens in a global Keras

  • graph.

  • And when you're creating an execution function,

  • you're going to be creating a new scratch FuncGraph,

  • and you're going to be doing a copy of the relevant subgraph

  • from the global Keras graph to your new graph.

  • And then that's this copy that you're going to be running.

  • So that's-- if you're creating different models,

  • the execution functions are all separate,

  • living in different FuncGraphs.

  • AUDIENCE: So here we have--

  • so it seems like there's some weird things where--

  • like for example, an optimizer is not used for eval

  • and predict?

  • What happens if you-- can you just not set an optimizer?

  • FRANCOIS CHOLLET: So, yes, you can just not do compile,

  • and then do predict.

  • That works, because you don't need to compile information

  • in order to run predict.

  • However, you do need to compile your model

  • if you want to run eval, because eval

  • needs to compute loss and metrics, which

  • are passed in compile.

  • AUDIENCE: Well, I guess-- but what if you don't--

  • like, you're only doing eval.

  • Do you need an optimizer?

  • FRANCOIS CHOLLET: Technically, yes.

  • I think it's possible you could pass non as the optimizer,

  • and that would work.

  • I don't recall.

  • AUDIENCE: I think it's required.

  • FRANCOIS CHOLLET: It's required?

  • Yes, so in that case you can just pass a default optimizer,

  • like this string LCD.

  • It's going to work fine.

  • AUDIENCE: OK, but I guess then it's

  • going to create a bunch of graphs

  • that I'm never going to use, but--

  • FRANCOIS CHOLLET: So the execution functions

  • are actually created lazily.

  • So they're not actually created in compile.

  • It's a good mental model to have to think

  • about creating in compile, if you want to think about it,

  • but actually--

  • for instance, it's when you call fits that a train function is

  • going to be created.

  • It's when you call evaluate that eval function is

  • going be created, and so on.

  • So if you just instantiate your model,

  • compile it, and call evaluate, you're

  • not going to be creating any graph [INAUDIBLE]

  • the optimizer, because you're only

  • creating the eval function, which does not [INAUDIBLE]..

  • AUDIENCE: I think it'd be [INAUDIBLE]

  • optimizer equal to [? normal, ?] specifically.

  • FRANCOIS CHOLLET: But it's kind of a niche use-case.

  • So anyway, you can definitely instantiate a model

  • and call predict without having called compile,

  • and that totally works.

  • AUDIENCE: Because-- are any of the--

  • so none of the arguments from compiler use are used for--

  • FRANCOIS CHOLLET: Are useful [INAUDIBLE]..

  • That's right.

  • AUDIENCE: OK.

  • FRANCOIS CHOLLET: Because [INAUDIBLE] is just

  • input to output mapping.

  • AUDIENCE: But aren't outputs also prepared in compile?

  • So just prepared [INAUDIBLE]?

  • [INAUDIBLE] predict?

  • AUDIENCE: [INAUDIBLE] when should people--

  • let's say they're using model [INAUDIBLE]

  • predict versus just calling the model and then evaluate.

  • FRANCOIS CHOLLET: So model that predicts

  • is going to iterate over the data you passed in many batches

  • and going to return them by arrays.

  • Meanwhile, calling the model on an EagerTensor

  • is the same as calling a layer, so it returns, directly,

  • this value.

  • If you have a single batch, there is no difference.

  • AUDIENCE: In a lot of cases, if you call model on something,

  • it goes through the eager path, whereas if you call

  • predict it goes through this function path.

  • And so if you're sensitive, essentially,

  • to the graph execution versus the eager time,

  • predict can be much faster.

  • FRANCOIS CHOLLET: Yeah.

  • But in terms of end output, if you have a single batch,

  • there's no difference.

  • But the big difference is that to predict, you could pass,

  • for instance, a data set, right?

  • Which you cannot pass in call.

  • Yeah.

  • So what happens now when you call fit?

  • There's pretty extensive checking

  • of the user-provided data, as usual,

  • checking that the correct data is being passed,

  • that there's correct shapes or rank and so on.

  • So optionally, we can set aside the validation splits.

  • That's only true if the input data is a numpy data

  • or EagerTensors.

  • That's not possible if you pass a data set or pattern

  • generator.

  • Then we prepare the callbacks.

  • So importantly, everything that happens dynamically

  • during training, apart from executing the graph functions,

  • is structured as a callback.

  • In particular, the logging that we do internally

  • and the display that we do of the progress bar,

  • these are all callbacks.

  • And if the user also wants to do some action

  • dynamically during training, they

  • have to implement callback.

  • AUDIENCE: But how often are callbacks called?

  • FRANCOIS CHOLLET: So they're called at different point

  • during training.

  • So a callback implements the methods

  • on_train_begin and on_train_end, which

  • are called at the very beginning before training

  • and after training is over.

  • Then for each epoch you have a method

  • on_epoch_begin and on_epoch_end, so called

  • before the start of an epoch and after the epoch is finished.

  • And finally, there is batch-level methods.

  • There is on_batch_begin and on_batch_end.

  • AUDIENCE: But is the infrastructure

  • able to know whether a particular callback is

  • implemented on batch beginning, on batch end,

  • to avoid maybe a per batch overhead?

  • FRANCOIS CHOLLET: So if the method is not implemented,

  • there is virtually no overhead to coordinate.

  • And we need to call it, anyway, for things like the [INAUDIBLE]

  • aggregator and so on.

  • AUDIENCE: For what?

  • FRANCOIS CHOLLET: The logging.

  • The logging callbacks, basically.

  • AUDIENCE: I guess in cases where we're

  • trying to put the whole training on a device, for example,

  • or we're trying to do, say, remote execution--

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: Where per batch, execution might be--

  • FRANCOIS CHOLLET: Yeah.

  • So one thing we've considered-- this

  • is a topic that has come before with TPUs and so on.

  • One thing we've considered is having a global--

  • well, an argument in fit, or something,

  • that specifies how often batch-level callbacks have

  • been recalled, like for instance every 30 batches and so on.

  • AUDIENCE: So the batch--

  • if a batch argument is only called

  • every 30 batches, or something like that, is that going to--

  • I mean, how does it work?

  • Does the-- are callbacks expecting

  • to see something every batch?

  • Are they going to still work if they're

  • called every 30 batches?

  • FRANCOIS CHOLLET: Typically that's still going to work.

  • The way it's going to work is that from the perspective

  • of your callback, it's going to be

  • the same as if your batches were 30 times bigger.

  • AUDIENCE: So the batch recaller could

  • be called [INAUDIBLE] batches?

  • AUDIENCE: Is it--

  • FRANCOIS CHOLLET: Yes.

  • AUDIENCE: Does it get the output?

  • AUDIENCE: When?

  • FRANCOIS CHOLLET: This is speculative API, right?

  • We don't have this to implemented,

  • but it is one way we've considered

  • handling the fact that you're probably

  • going to be wanting, at some point,

  • to do processing of multiple batches

  • at once on device with no contact with the host.

  • AUDIENCE: So I guess, could you just

  • tell us what on_batch_begin arguments are?

  • Like, what sort of information is passed at this point?

  • FRANCOIS CHOLLET: Right, so it receives a log dictionary.

  • It receives the index of the batch as well.

  • The log dictionary contains the current value of the total loss

  • and the metrics.

  • AUDIENCE: So that-- the loss from the last batch,

  • or the loss [INAUDIBLE]?

  • FRANCOIS CHOLLET: So that's the loss from the last batch,

  • and that's the other metrics, the moving

  • value of the metrics.

  • AUDIENCE: But even if we change the callbacks, I think--

  • and as long as the loop itself is in Python,

  • then it doesn't really help, right?

  • Even if you're not calling a callback,

  • as long as a loop is still in Python, it doesn't really--

  • AUDIENCE: [INAUDIBLE] try to turn the loop

  • the loop into a pass function?

  • AUDIENCE: Yeah, I think that would be required as well.

  • AUDIENCE: [INAUDIBLE] the expectation

  • is that the callbacks are operating numpy [INAUDIBLE] not

  • tensors.

  • AUDIENCE: And I think we need to change the APIs so

  • that the callbacks are operating [INAUDIBLE] so

  • that we-- that would work.

  • AUDIENCE: I mean, I think in a perfect world,

  • we would use a mixture of Python and py function to essentially

  • only run [? py ?] in Python the parts that we want,

  • while still keeping the outer part.

  • But I mean, we're not there yet.

  • AUDIENCE: And since callbacks are passed down through this,

  • can I rely on the sequence of how they're passed?

  • Is it the sequence how they're going to pass again?

  • FRANCOIS CHOLLET: Yes, so the sequence

  • in which they are called matches the order in which they're

  • passed in the callbacks list that you pass to fit,

  • meaning that it is possible for one of your callbacks

  • to add some values to the log dictionary.

  • It's the same log dictionary that's

  • going to be seen by the callbacks after that.

  • It's possible to do cascading processing methods.

  • AUDIENCE: And the default callback,

  • the progress bar and stuff are called at the end?

  • FRANCOIS CHOLLET: The progress bar and logging stuff

  • is called at the very end, meaning

  • that if you add stuff to the log dictionary,

  • it's going to be displayed.

  • But the very first callback that's passed

  • is actually the callback that starts populating the logs.

  • AUDIENCE: There's also a slight bit of nuance

  • in that the model itself is set as an attribute on callbacks,

  • and so any attempt to, essentially, optimize this

  • or put this in a tf.function scope

  • would have to be aware of the fact

  • that it's a legitimate use case for a callback

  • to start accessing arbitrary parts of the model.

  • AUDIENCE: Right.

  • I guess my question is to more about running callbacks

  • less frequently, not about trying

  • to run them inside a function.

  • FRANCOIS CHOLLET: OK.

  • So our last topic is losses and metrics.

  • The losses are very simple.

  • So they're just subclasses of this loss base class.

  • They just have to implement one method, which

  • is call, just like layers.

  • The signatures would be different.

  • Yeah, the signatures are [INAUDIBLE] y_true, y_pred.

  • Optionally, you can also have sample weights.

  • Yeah.

  • And you just return a loss.

  • So importantly, as a convention, the loss return

  • should be one loss value per sample.

  • So here, for instance, we are reducing on the loss axis,

  • but we are not returning a scalar.

  • Metrics are a bit more involved.

  • There's three different methods you should be

  • implementing in your metrics.

  • You have this update state method,

  • which is very similar to the call method for losses,

  • so y_true, y_pred, sample weight in the signature.

  • And the difference is that you're not returning a value.

  • You're just updating the internal state of your metric.

  • So in this case, your internal state

  • is just this one scalar weight called true positives.

  • So here, you are just adding to this value.

  • Then the second method you have to implement

  • is results, which, as the name indicates,

  • just returns the current value for this metric.

  • And finally, you need a method to reinitialize

  • the state of your metric.

  • And what happens, for instance, when

  • you call fit is that at every batch,

  • we're going to be updating the state of the metrics,

  • including this update state method.

  • When you want to report a value, which--

  • well, typically it's also after every batch,

  • we call this results method.

  • And at the end of an epoch, we want to reset the metrics,

  • so we call reset states.

  • And the way you specify metrics in the compile API

  • is basically you just have this metrics document

  • and compile mistakes, a list of these metric instances.

  • So if you have metrics with signatures or requirements

  • that do not match this API, that you can update the state

  • based on y_true, y_pred, and some simple weights,

  • for instance.

  • One thing you can do is write a layer that will--

  • inside its call method, that we call self.add_metric.

  • We just call our tensor.

  • And that enables you--

  • there is two arguments in two pass.

  • In each pass, name, because that's

  • what's going to be reported with the progress bar on your logs,

  • and so on, and then aggregation argument,

  • which tells the framework how the--

  • so you can assume that this is basically

  • called for every batch.

  • So how does these different values for every batch

  • get aggregated into a single scalar, right?

  • And then in the functional API, you

  • can use this layer like this.

  • You insert it at some point, and just returns the same

  • as its input tensors, which you can keep using.

  • And when you do that, you get a model

  • with a forward pass that's going to be calling

  • this add_metric at every batch.

  • AUDIENCE: Are there any standard, predefined versions

  • of metric log layers?

  • FRANCOIS CHOLLET: No, but it looks just like this.

  • So there's nothing special you have

  • to do, apart from implementing call with add_metric in it.

  • AUDIENCE: How does that work in eager execution?

  • FRANCOIS CHOLLET: In eager execution,

  • this is literally called at every batch.

  • AUDIENCE: So add a new metric at every batch,

  • or if two people created metrics at different--

  • with different metrics the same name?

  • Can you cache that?

  • AUDIENCE: We keep track with the metric, with the name,

  • and just call the same metric again.

  • So--

  • AUDIENCE: Actually, so if you have two layers

  • and they call that metric, and they both use the same name,

  • then you're going to have a collision?

  • AUDIENCE: We raise an error in that case.

  • AUDIENCE: Oh, so like--

  • AUDIENCE: We can detect.

  • AUDIENCE: You've just detected the layer?

  • But if the layer has two metrics of the same name--

  • AUDIENCE: Same--

  • AUDIENCE: --assume that's intentional?

  • FRANCOIS CHOLLET: So essentially,

  • in one forward pass, you cannot have the same name twice.

  • But across different forward passes, so

  • across different calls of your metric--

  • AUDIENCE: [INAUDIBLE]

  • FRANCOIS CHOLLET: --of your model,

  • you're going to be aggregating the metrics based on the name.

  • AUDIENCE: So there is some state that gets reset?

  • FRANCOIS CHOLLET: Yes, that's right.

  • AUDIENCE: So is this similar to the thing

  • where layers know if they're being called by another layer?

  • FRANCOIS CHOLLET: Yes, it's very similar.

  • It's basically call context.

  • And at the end of the call context, you reset the state.

  • So you reset the losses that were created

  • during the forward pass, and you reset

  • the state of the symmetric aggregation thing.

  • Right.

  • But what if this is not actually enough for you,

  • because you want a metric that not only sees

  • the input of your model, but also the targets?

  • Which is actually relatively common.

  • One thing you could use--

  • so, of course, if you're just doing a model

  • subclassing or writing your own custom training loops,

  • you have no restrictions whatsoever,

  • so it's not relevant to you.

  • But what if you really want to use fit with these very

  • arbitrary metrics?

  • Well, one thing you can do is the endpoint layer pattern.

  • So how does it work?

  • It's basically a layer that's in the functional

  • API you would put at the very end of your model,

  • so it computes predictions, right?

  • And it's going to take as input whatever you want.

  • In our case, it's going to take the targets of the models

  • and logits generated by this dense layer here.

  • And the targets are an input, right, created here.

  • And then what is it going to do with inputs and targets?

  • It's going to compute a loss value

  • so it returns a scalar, in this case, because we call it

  • within this [INAUDIBLE] call, which

  • is different from the plain call method.

  • So it's automatically reduced in this case.

  • You can add whatever metrics you want.

  • Note that if you add the training argument here,

  • you could be logging different metrics in training

  • and inference.

  • And finally, you return what you would want the predict method

  • to return, so a softmax.

  • AUDIENCE: Well, it seems to me like you can't

  • use this with inference at all.

  • FRANCOIS CHOLLET: So in inference,

  • you would be using the same layer, for instance.

  • But what you can do is you have to rewire your model,

  • because you're not going to have these targets inputs.

  • But you can use the same layer, and the only thing

  • you need to be mindful of is that you

  • should do a conditional check on the inputs that

  • were passed to see if there is a target key in there or not.

  • Or you could rely on the training argument.

  • Also works.

  • AUDIENCE: So you're saying that if you create this model

  • like you show it in there, and then I call

  • and model that predict on it--

  • FRANCOIS CHOLLET: So if you just reuse this model,

  • this is a model that's already functioning.

  • If you reuse it for predict, you're

  • going to have to pass some value--

  • some dummy value, for the targets.

  • So if you want to not have to pass dummy value for targets,

  • you need to redefine a new model that is going

  • to be using the same layers.

  • The only difference is that it's not

  • going to instantiate this target's input object

  • and it's not going to be passing the targets

  • to the LogisticEndpoints.

  • Any logistic endpoint is going to ignore--

  • it's not going to attempt to access the target's [? key. ?]

  • Yeah.

  • So when you instantiate this model like that, you state,

  • it starts from inputs and targets.

  • It returns predictions.

  • And when you fit it, you fit it with a dictionary or a data set

  • with returns a dictionary, and you include the target's data

  • in that dictionary.

  • So like this, right?

  • And when you compile it, you are not

  • going to specify any loss in compile,

  • because the loss is added entirely

  • inside that endpoint layer.

  • So you just specify the optimizer.

  • AUDIENCE: So is the name in, like-- when

  • you define the inputs and [? endpoint ?] for target,

  • is the name in there supposed to match

  • the key in the dictionary?

  • FRANCOIS CHOLLET: Yes.

  • So when you call this layer, you're passing this dict.

  • This dict is--

  • AUDIENCE: No, not the LogisticEndpoint.

  • I mean the input layer.

  • How does [INAUDIBLE] note as inputs, inputs and targets,

  • the target instead of the other way around?

  • FRANCOIS CHOLLET: So, OK.

  • So it's a bit confusing because we have a naming collision

  • here, but in your data dictionary,

  • the keys are supposed to match the names

  • that you give to your inputs.

  • AUDIENCE: OK.

  • FRANCOIS CHOLLET: So here, you're

  • creating inputs that's named inputs

  • and you'd create targets that's named targets.

  • And this is what these keys here are referring to.

  • So yeah, to deconfuse, we could have

  • chosen another choice of name for the inputs

  • to the [? LogisticEndpoint ?] layer.

  • AUDIENCE: Question.

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: Is it possible-- like, in here,

  • you are declaring the loss and adding loss on the layer level.

  • Is it possible to just declare the loss

  • and then use the loss at add_metric, and then don't

  • do softmax, and then just return the y_true and y_predict,

  • just pass this through?

  • Like, later on, like, you still can declare a loss

  • on the model level?

  • AUDIENCE: So that--

  • I don't think there's a way to tell Keras

  • that one of these outputs of the model is the labels.

  • FRANCOIS CHOLLET: Yeah.

  • So if you want to pass your loss in compile,

  • it means that the targets are going

  • to be passed separately in fit.

  • And it means that your loss should match the signature

  • y_true, y_pred, sample weight.

  • So if you have anything that's outside of this template,

  • you actually do need to use add_loss

  • and to use this pattern, which is very general.

  • Like, there's literally nothing you cannot implement with this

  • pattern.

  • And again, the purpose of this pattern

  • is to make it possible to use fit, the plain call to fit,

  • with complicated loss and metric setups.

  • Yeah.

  • So what if you have a use case that's even more complicated

  • and you still want to use fit?

  • Well, the answer is don't, because fit is really

  • this built-in thing that's not meant for every use case.

  • It's meant for common use cases.

  • If you need more flexibility than what's

  • provided by this endpoint pattern, for instance,

  • you should really write your own loops,

  • which is not difficult at all.

  • So that's it for the overview of Keras internals.

  • And I see we actually spent like 45 minutes.

  • [LAUGHTER]

  • We did.

  • So it's nice that we had the two full sessions.

  • Yeah, so thank you very much for listening

  • and for the interesting questions.

  • AUDIENCE: Thank you.

  • This was very informative.

  • AUDIENCE: Yeah, thank you.

  • [APPLAUSE]

  • AUDIENCE: So could you highlight--

  • you talked a lot about how everything

  • works in functional models.

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: Could you highlight just any differences

  • with subclass models or sequential models or anything

  • like that?

  • Because I know for sequential models, for example--

  • FRANCOIS CHOLLET: So--

  • AUDIENCE: --build happens at a different point than--

  • FRANCOIS CHOLLET: Yes.

  • So I covered writing things from scratch towards the beginning,

  • but basically, there is very, very little to know about it,

  • because everything is explicit.

  • And you are in charge of doing everything.

  • So because of that, there's not much happening under the hood.

  • There is virtually no hood.

  • But yeah, basically the first 10 slides in this presentation

  • covered pretty much everything you need to know.

  • AUDIENCE: I guess it felt like it

  • covered the difference in [INAUDIBLE] perspective,

  • but is there anything, like things that we might expect

  • to be true about models, like that they're already

  • built by the time they get to fit or something

  • like that that may not always be true for a sequential model

  • [? or something? ?]

  • FRANCOIS CHOLLET: So if you're using a subclass model

  • and you're using fit, one thing to keep in mind,

  • actually, is that when you call fit, the model is not built,

  • so the framework is going to be looking at the input data you

  • pass to fit and going to assume that you made no mistake,

  • and that the model expects exactly the structure of input

  • data, and is going to use that to build a model.

  • AUDIENCE: OK.

  • So it calls build?

  • AUDIENCE: So build happens in fit in, for example, class--

  • AUDIENCE: And--

  • AUDIENCE: --models?

  • FRANCOIS CHOLLET: So that's only if you're using

  • a subclass model plus fit.

  • If you're using a subclass model plus an assumption

  • look like, yeah, this one, for instance,

  • there's really nothing special in it, you know?

  • AUDIENCE: OK.

  • AUDIENCE: Is the role of train and batch, test and batch,

  • and evaluate and batch relating to the last slide, where

  • you said that if you do something complicated,

  • write your own rules?

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: Is that why they exist?

  • FRANCOIS CHOLLET: Yes.

  • So the-- these are ways to run the train, eval, operate

  • execution functions for a single batch.

  • And they are useful if you want to customize your training

  • loop, but you don't actually need the level of flexibility

  • that's provided by the gradient, for instance.

  • Essentially, you're going to be using that if you're

  • doing gains, for instance, or if you're

  • doing reinforcement learning.

  • Manually computing the gradients with the gradient type and so

  • on is mostly useful if either you

  • want to understand every step of what you're doing

  • and not delegate anything to the framework,

  • or if you need to do some manipulation on the gradients

  • themselves, like some form of gradient normalization

  • that's not covered by the optimizer API,

  • or you want, maybe, to actually have your gradients be

  • generated by a different model, like the actual gradients.

  • Yeah.

  • These type of advanced use cases involve

  • manipulating the gradients.

  • AUDIENCE: Could you go back to the slide

  • where you showed the subclass model?

  • FRANCOIS CHOLLET: Which one?

  • This subclass model?

  • Let's see.

  • This one?

  • AUDIENCE: Sure.

  • So this doesn't actually define the build?

  • FRANCOIS CHOLLET: So you don't need to, because would only

  • be to create variables.

  • For this model, that needs to know

  • about the shape of the inputs.

  • If you don't need to know about the shape of the inputs,

  • you can just create them in the constructor.

  • And in this case, this model has no variables of its own.

  • The only variables of this model come

  • from the underlying layers.

  • And when you call the layer for the first time, which

  • is going to happen in fit, you're

  • going to be calling this call method, which

  • in turn is going to be calling the underlying layers.

  • And that's when they are called, that internally they're going

  • to be calling the build method.

  • AUDIENCE: I had this impression that--

  • Igor and I came up with this case

  • where build was non-trivial, even for a subclass model?

  • AUDIENCE: I think you're referring to, like--

  • I think there is a place where you

  • call that model with like default inputs,

  • if those inputs are of such type that can have default values,

  • like a dummy call--

  • [INTERPOSING VOICES]

  • AUDIENCE: --and a build graphing?

  • FRANCOIS CHOLLET: Yeah, so that's

  • that specifically when you call fit on the subclass model that

  • has never been built before.

  • In that case, the framework is going

  • to make some assumptions, some inference,

  • based on the data you pass to fit.

  • If you want to avoid that, you can either explicitly

  • implement a build method and call it,

  • or you could call your MLP instance here

  • on one EagerTensor once.

  • And that's going to be sufficient to build the model,

  • because it's going to run this call method.

  • Well, first of all, it's going to call the build method

  • if it's implemented, so yeah, it doesn't do anything.

  • Then it's going to call the plain call method,

  • and in doing so it's going to call--

  • [INAUDIBLE] call for all the layers.

  • Each layer is going to call build, then

  • call the plain call.

  • AUDIENCE: Do [INAUDIBLE] users use things

  • like add_loss and add_metrics even if they're

  • doing a custom training model?

  • FRANCOIS CHOLLET: Yes, that's totally fine.

  • So using add_loss with custom training loops works like this.

  • So after your forward pass, model

  • the classes has been populated.

  • And so you can just add the sum of these losses

  • to your main loss value.

  • For metrics, you're going to have

  • to query model.metrics and look at the names.

  • And so for metrics, if you're writing a custom training loop,

  • it's actually easier to do every step manually.

  • So start by instantiating your metrics outside of the loop.

  • Then inside the loop, for each batch, you call update.

  • Then at the end, or at the end of an epoch, you call results,

  • and you log that.

  • Add_metric is usable with custom training loops,

  • but it's not very ergonomic.

  • Add_loss is very ergonomic with custom training loops.

  • AUDIENCE: Are all the bias in model loss scalars

  • or per batch?

  • Or--

  • FRANCOIS CHOLLET: They're all scalars.

  • We don't support non-scalar losses.

  • Correct.

  • AUDIENCE: On the TensorFlow [INAUDIBLE]

  • how many can I access through Keras core layers?

  • And then, [INAUDIBLE] to be developing new core layers,

  • too?

  • FRANCOIS CHOLLET: What would be an example of--

  • AUDIENCE: For example, this [INAUDIBLE] in TensorFlow.

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: They're not used by Keras?

  • FRANCOIS CHOLLET: So Keras is meant

  • to support a number of sparse operations.

  • For instance, the dense layer is meant to support SparseTensors.

  • I'm not sure if it's true today, but it's definitely

  • supposed to work.

  • Other layers, maybe-- yeah.

  • I think it's mainly the dense layer.

  • Maybe activation is one.

  • But yeah, the large majority of layers, like LSTM,

  • [? carve, ?] [? pulling, ?] and so on, they're only for dense

  • data.

  • Dense flow data.

FRANCOIS CHOLLET: So last time, we

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it