Subtitles section Play video Print subtitles SPEAKER: We've been sort of ambitious here in terms of all the different kinds of distribution we want to support. In particular, I think I hear a lot from people who think of tf.distribute.Strategy as a way to get access to an all-reduced programming model, distribution model. But really we are trying to do something that's much broader in scope. This is-- it introduces a number of challenges and is a bit more restrictive when it-- in terms of users of the programming model that users of tf.distribute.Strategy can use. And I'll go into that a little bit more. But I believe it's the right move, because it will allow not just a variety of different use cases, but a broad range of maybe future work. What I am going to talk about is how easy it is to use and how we've made it so that you can switch between distribution strategies, because the code really is not really entangled with user code, with a few exceptions, which I'll go into. It's also-- we've done a lot of work to get very good performance. But I am really not going to talk about that today. It's very helpful to go through what these different training architectures look like. So you sort of have some idea of the range of different use cases that we're trying to accommodate on the distribution side. Right now, we are doing only data parallelism and tf.distribute.Strategy. That means we are dividing up our input across into a bunch of different pieces. The model itself is replicated across maybe 100 devices or workers. And there is some shared view of the model variables that all of those models-- copies will update via some mechanism that is maybe specific to the particular strategy. So the oldest approach here is parameter servers and workers, which was what was used by disbelief prior to TensorFlow's creation and was the model assumed by TensorFlow libraries prior to tf.distribute.Strategy. For example, Estimator pretty much assumes a sort of a primer sort of model. And here what is going on is that we have each worker task is going to be communicating with the parameter server tests. Its variable is on a single parameter server test. You might have multiple parameter server tests, though, if you have lots of variables, or they're big. It become-- you would, otherwise, need more of them to avoid being a bottleneck. This model was very well-suited to the early days. You have the uses of TensorFlow at Google, which were characterized by the fact that we add huge quantities of CPUs in a data center. And we had large betting matrices. And we would do very sparse lookups in those large embedding matrices. The main thing different here about the distribute Strategy version of ParameterServerStrategy is that it has much better support for multiple GPUs on a machine. We call this-- it's a between graph strategy, which is sort of a term from TensorFlow 1, where we talk a lot more about graphs. In TensorFlow 2 land, we don't really have a better term for this. But it's just that we're-- each worker is going to be running its own main function that's going to sort of operate independently and schedule operations on itself and the parameter server tasks that it's communicating with. This allows these workers to run asynchronously. And that combination of working between graph and asynch makes it easy to tolerate failures. So you can run pre-emptable workers at lower priority or whatever. As long as your parameter servers stay up, you can keep going without really any interruption. There is a lot of RPC communication here. And this sometimes shows up as a performance blip at the beginning of a step when it's reading variables, particularly if the runtime isn't clever about, like, the order in which it reads the variables, which can be because there's some operations it can do on all the variables. But it really needs certain variables to do the first layers first. So and there has been-- we've seen some performance hiccups as a result of that. Moving on to what I call the new central storage strategy, which I think is still in review, but all basically, we're talking about-- I'm going to be talking about, like, the sort of vision of where we want to be. This is the ParameterServerStrategy writ small. It's restricting the whole-- all-- everything to single machine. And now, we're distributing not across machines so much as the accelerators within a machine. But again, the implementation is almost identical. It's just the configuration is different, because all of the variables are stored as a single copy on the CPU. And this is also known as PSCPU in the TF CNN benchmark language. So on one machine, there's no need to have multiple clients or worry about asynchrony. This all run-- can run synchronously. So that's a little bit different than it. And you might use this in a situation even on one machine, where if you have large embeddings, it won't fit on your GPU or if your particular model or hardware-- it finds-- is particularly well-suited to this. I think on TF CNN benchmarks, they've done-- have this giant matrix of what's the best way for any particular model hardware combination. Something to experiment with-- probably the thing that is most well-known for tf.distribute.Strategy, though, is the all-reduce sync training. So this is great for where you have quite a lot of variability and good connection between all of your devices, because they're going to be operating basically in lockstep. We're going to do one training step across a whole job. And it's all going to be coordinated via a single client. And so this is used both for TPUs and multi-GPUs within one machine, using mirrored strategy or TPU strategy. And in this strategy, we mirror the variables of the model onto each of the device. So let's say you had a variable-- let's call it A-- in your model. And we're going to replicate that by each device having its own copy of the variable locally so that it can just read it. And there's no delay. Together, these form a conceptual mirrored variable, which is there's actually a mirrored variable object that we return to the user instead of these component variables to store as their, like, model member variables. And then, we keep these in sync by applying identical updates. And this is where we're going to need some way of communicating between the different devices in order to make sure those updates are all the same, which brings us to all-reduce, which is a standard CS algorithm for communicating between devices in a network efficient manner. So here, all means we're going to be communicating from every device to every device. And the reduce means we're going to do basically a sum or mean. There's some other things like max and min and so forth that you can also do. But this is really where the synchronization is coming between the devices and where we're also going to be spending-- spend a significant amount of work adding new capabilities into the runtime. So how does synchronous training actually look? So let's say we have a model, two layers on two devices. And each of those layers has a variable. So there is really four component-- variable components here, because we're keeping separate copies of each of the two variables on each of the two devices. Each device gets a subset of the training data and does a forward pass, using just those local copies of the variables. And then, in the backward pass, we, again, compute gradients using those local variable values. And then, we take those gradients and aggregate them by using an all-reduce that communicates a single aggregated gradient value out to all the replicas. And each replica then applies that gradient to its local variable. Since the all-reduced produces the same value in all the replicas, the updates are all the same. And the values stay in sync across all the different replicas. So the next forward pass here can start immediately. You know, there's no delay waiting for the values, because we have all-- by the end of the step local copies of all updated values for all variables. And furthermore, we can get some parallelism here by doing all reduces of the gradient's one layer overlapped with the computation of the gradients of other layers. So these go all the way in line. And so we have keeping both the network communication and the computation parts of whatever hardware you have busy at the same time. That's great for throughput and performance. And this does perform-- we observe that on many, many models, this performs well. And we want to scale this up to multi-machine. And so there's members of our team-- [INAUDIBLE] and [INAUDIBLE] have made collective ops and a strategy for doing this on multiple machines, so using these new collective ops. So we have multi-worker mirrored strategy that implements this. This is a little bit experimental. We're working on it. But it works today. You can use it. And it employs these new collective ops with, again, between graph so that each worker is only scheduling the ops that are running on that worker, which is good for scalability. But again, it's the same model as the mirrored strategy, where everything's running in lockstep, synchronizing the gradients on every step. We have a couple of different all-reduce implementations. I think it can use nickel. But also there's-- like, you can do ring all-reduce within a machine. You can also do it across machines, very similar to the multi-GPU situation. Or you can aggregate within each machine, communicate across machines, and then broadcast in a sort of hierarchical way. And in different situations, those-- one will perform better than the other. Last strategy is OneDeviceStrategy, which we currently don't expose. But it's good for, like, you want to be able to supply a strategy to a function that's going to open that strategy scope. And you want it to work and also in a non-distributed context. So now, I'm going to talk a little bit about things we don't have today but maybe are coming, possibly depending upon interest. So if you would like us to prioritize some of this stuff, let us know. We have some GitHub issues, where we can collect use cases and interest in these things. So once upon a time, now deprecated, there's was the SyncReplicaOptimizer, which you combined the sort of parameter server style of variable storage but with the synchronized variable updates. You might also want to have a sort of a hybrid strategy, where you have mirrored variables in all-reduce for most of your variables. But if you have a large embedding that won't fit on your GPU, maybe you just put that on the CPU as the central strategy. So we have GitHub issues tracking both of those possible features. Similarly, model parallelism is something, where we have some infrastructure in place so that we can eventually add this-- but it is not there yet-- on the ideas you would specify a number of logical devices, and then manually place the particular ops or layers or whatever onto those logical devices. And then, that computation would then be spread across those logical devices times the number of replicas, actual physical devices. Now today, if you want to do model parallelism, I'm going to recommend Mesh-TensorFlow. It actually works and is out there. And it has a different model of doing model parallelism, where instead it's splitting operations across lots of devices. And in general, I think that's a good fit for backprop training, because you have to keep all these intermediate variables around in order to do the backwards step when you're doing the forward step. And that just-- if you just work it out, just is, I think, a better fit for those type of training. However, there is one case, where the other sort of parallelism is really natural, which would be to do input pre-processing on a separate job. And that is a good fit, because there's no gradients involved in input processing, so no need to hold on to your intermediate values from the forward pass. If you're interested in this, another GitHub issue-- express your interest. Explain your constraints. There are some questions, for example, when we implement this, like, do we need to support complete reproducibility or deterministic allocation? Or maybe we could just have, like, a bunch of queues and threads running as fast as they can. And you get the records that in the order that they come. It would be good to know if you're interested in this what you need. OK. So that puts us into the next stage. How do these studies actually work under the covers? And when we were doing this, we took a very sort of simple view of we just basically tried to figure out what the code looked like written with mirrored strategy and with ParameterServiceStrategy. And we saw what changed, anything that changed to how to be the responsibility of the strategy. And what you learned doing this exercise is that keeping state and updating state are the things that change. And so when you switch strategies, things like, you know, variables, batch norm updates, metrics-- all those things sort of need some little bit of support to become more distributed and work well in a distributed setting. And so we need a little bit of help to say, OK. We're about to compute an update for a variable. This is something you're going to need to do specially. So what we do is we've made the TF library, in general, responsible for including any changes needed in order to identify what is going to have to change when you distribute it. We can't really, at this point, just take the sort of sea of ops that you get from maybe saving your model to disk and in a graph def and magically understanding the intent behind all those ops in order to efficiently distribute it. We really need those hints that the library provides by delegated calling strategy APIs. But the good news is is that almost all of that is restricted to TensorFlow libraries and in many cases, just base classes and not subclasses. And I'll get more into that later. So the best-- the simplest case is if you're using something like Keras and Estimator or Estimator or Keras and Estimator. And we control in the TensorFlow library your training loop. And in that case, we can make the experience very easy. If you are in-- but we are very interested in enabling new use cases, where the user wants to control the training loop. And in that case, you might need to do a little bit more. But hopefully, we also gave you new capabilities and lots of good performance. And so you'll be happy to add a few things to your code in order to distribute your custom training loop. I will talk a bit about also if you're a TensorFlow developer and you want to make your library work with distribution strategy, because it somehow interacts with state that needs to be distributed. I'll talk about that some. I'm probably not going to-- I'm not going to talk really directly about making a new strategy. If you want to make a new strategy, you're going to need, at this time, pretty heavy involvement of the [INAUDIBLE] team. And so talk to us. We'll help you. Right now, we don't expect there to be a huge number of them. We have reached out to the Horovod team. And we have had some initial work on making a strategy with them. But now, I'm going to go a little-- so there's different API surfaces for each of these use cases if you're making a new strategy. But you have the whole of the library. But there are much smaller slices, if, in these other cases. So in the simplest case of the Keras and Estimator you just basically need to know the constructor of these strategies and this scope thing. So if you just add these two lines to your Keras training loop with-- using compile fit, we should modular bugs in the feature requests and things like making everything work-- the intent is is that that just works. You get mirrored for free. You get a big performance. But the most-- basically, when you say strategy-- put everything inside the strategy scope, you're selecting this strategy, saying it's the current strategy. And it's the strategy that should be used for everything. Most important part of that is taking control of variable creation. So when you are saying, you know, tf.keras.applications.ResNet50, it's creating a bunch variables. Or maybe it waits and tell you to model, compiler fit. But whenever it creates the variables, it's inside the strategy scope. And so we control the variable creation. We can use mirrored variables instead of regular variables. And all of these Keras libraries and library function goals have been made distribute aware. So really, there's not much else for the user to do. With Estimator, again, about two lines-- it gave me-- most of the API is just the constructor. So all we really need to do is pass that distribution strategy to the Estimator's RunConfig, either for training or distribution-- I mean, training or avow or both. And when the Estimator calls the user's model function, it will call it once per replica inside the strategy scope automatically and so that we use-- and as a result, it will use mirrored variables. It also uses a special variable creator, so, because we're going to call this function once for replica. We want to make sure it uses the same variables each time, maybe different components of the mirrored variables, but the same mirrored variables. And Estimator has been extended to know how to merge the result of all of these model function calls into a single and coherent answer. So this sort of is an introduction to-- or a good time to talk about one of the concepts that we have inside distribution strategy, which is mirrored versus per-replica values. Mirrored values are the same across all replicas. So mirrored variables are an example of this, where we go through a lot of effort to make sure that they are in sync. But there are other things that can be mirrored. For example, the output of reduction-- when we aggregate the gradients, those will also be mirrored across all the replicas. And in general, when we update a mirrored variable, we need to update it with mirrored values. So we know that the updates are all the same. Per-replica values are sort of almost everything else-- basically, all these things that are going to be different on every replica, they-- like the different inputs, the different activations, and so forth. And we aggregate these values, either the mirrored or per-replica values into these aggregated containers that have a single value per-replica-- these container types, so like mirrored variables. An example of one is something we actually can't hand to the user-- for example, if the mirror variables will be stored in the model as the model's member variables. And we've done some operator overloading so that these-- in the cases where the operations are safe to do on the mirrored variables, we will do them for you. Not all of the operations will be safe. And I'll go into that later. And then, there's this shared variable creator, which is, in addition to us intercepting variable creation in order to create mirrored variables instead of regular variables, we want to make sure that each call to the model function produces the same variables. And there's a number of heuristics that we use to make sure that when you create a variable in each model call that we return the same variable as was created for the first model call. So going on to custom training loops, we're going to now start getting a little bit more into the programming model that distribution strategy expects all callers to conform to. So-- and it's a little more exposed when you're writing a custom training loop-- so you start from data sources, which are typically variables or data sets. I guess, they could also be constants, which is not the most interesting example. But then, each replica-- again, this is data parallelism is going to be running a computation on that data. And then, we have to somehow combine the results. And we have basically one tool here, which is a reduction or potentially concatenation too. But that's turns out to be not the common case. Now, what do you do with that reduction? So if you are using an optimizer, what you do is you are going to add all the gradient updates-- gradients you get from all the different replicas-- and then broadcast that reduced value to where all the variables are. Now, in the mirrored strategy case, the variables are in the same place as the computation. So that's becomes an all-- that's just an all-reduce. And then we use the-- to update variables. Another really common thing is people want it, like, the average loss or something like that. So we just take the reduced value, return it to the user, print it out. Great. A sort of new capability here when you're using distribution strategy is to broadcast the aggregate value communicate-- from all the replicas back to all the replicas and then do some more computation. So hopefully, this is going to unlock some doors by allowing more complicated distributed algorithms from researchers. So we'll hopefully see the sort of MPI style of distributed computation a lot more now that distribution strategy is available. So this is just a picture representation of what I was just talking about. Now, the dotted arrows are-- or dashed arrows-- are per-replica values that-- you know, activations and so forth-- that can be the input to a reduce in order to become a mirrored value, which, as I said, can be used to update a variable or just return to the user. Now, I'm going to have several slides here, where I'm going to go in detail to an example of using the custom training loop. There's going to be a little bit here that's going to be future work. But it's all in the plan. This example is going to take a few slides. So I can go into some detail. But I'm going to show you how to make a custom training loop. Like in the Keras example before, we create a strategy, open its scope. Not every operation actually has to be inside the scope. But it's much simpler if we just put everything inside the scope since that works. And that's just a simple rule. And in the future, you won't have to worry about what goes in and what goes out. Just put everything inside-- works great. So the first thing we do is, in this example, is create a data set. And we're going to pass it the global batch size, just like in the Keras case. It's the strategy's job to split that across replicas. Now for now, we need users to explicitly wrap their data sets using a strategy method we call experimental_distribute_dataset. In the future, we'll do this automatically for any data set iterated inside a strategy scope. If the automatic splitting algorithm is inappropriate for whatever reason, you can manually specify how to split your data set, using a function that takes an input context and returns a data set with a per-replica batch size. So just like in the Keras case, again, the scope controls variable creation. So whenever you create your model, it's best to do that inside the scope so that any variables will be created using the policy dictated by the strategy. Originally, we tried making the Keras loss classes automatically scale the loss values according to the number of replicas. We found that that did lead to some user confusion. So for now, we've switched to requiring users to explicitly specify a NONE reduction and do the reduction as part of a later step that you'll see in a future slide. Or alternatively, you can just use any tensor to tensor function directly. In addition, optimizers have been made distribute-aware. I'll talk about that in detail later. So here, we define the function with the main computation of our training loop that we're going to perform every step. This function will be called once per replica, at least in the mirrored strategy case. And the model function may create variables, at least in the first column. And that's important, because that's frequently something Keras will do if it was unavailable-- if the input shape was unavailable at the time the model was created. But this is fine since we're going to be running this inside the strategy scope. And variables will still use the strategy's policy. Here's where we're going to average the loss using the global batch size. And that's a good policy independent of how many replicas you have or whatever. For regularization losses, we use the scale regularization loss API, which divides by the number of replicas so that when you add up across all the replicas, you are going to get something that scales with just the variables, not how many replicas you have. By having an explicit call, we hope to reduce the confusion that we saw with our earlier approach, where we tried to automatically scale losses by the number of replicas on user's behalf. We're going to create a gradient-- computer gradient, using ordinary TensorFlow 2 APIs. This gradient is going to be local to each replica and then passed to the optimizer, which is distribute-aware and is going to deal with aggregate ingredients, among other things, which I will go into detail later. So those first two slides of the custom training loop were demonstrating computation in cross-replica mode. And this last slide was computation in replica mode. In replica mode, we're operating on ordinary tensors. And we can use the full TensorFlow API to specify the computation that is going to be repeated on each replica device. Cross-replica mode-- instead, you are operating on aggregate values, which are maps from the replica to the tensor or variable on that particular replica. In the future, we're going to actually add a logical device component in order to support model parallelism, where you're actually split a model across multiple logical devices. We also have an update mode that is going to be used inside the optimizer to update each variable. It's going to run code on whatever devices that variable resides. In ParameterServerStrategy, this will be a single device, but maybe a different device for each variable, whereas in mirrored strategy, this will run all on the same devices as the computation. Moving on with our example, here we're going to train a whole epoch. We currently recommend running this in graph mode, which we get with the tf function decorative there at the top. We have tests, though, that verify that our API is working in your mode. You'll likely want the performance of running different replicas in parallel. If you want to do some per step processing that requires eager, we recommend using a P-- tf Python function. So we're going to iterate over our data set. Right now we are depending upon the call to experimental distribute data set from the earlier slide to split the data set across all the replicas. The plan is to do this automatically whenever you iterate inside a strategy scope. Note that this is particularly tricky in the multi-worker case. In the one machine case, this is just splitting each batch. But with multiple machines, we want to do some sort of decentralized splitting so you're not getting the same input on different workers. In the body of the loop, we're going to transition between cross-replica and replica mode, which involves explicitly using strategy APIs. The replica step function from the earlier slide will be called in replica mode, once per replica on different input shards. There's a tricky situation here when we are at the end of a data set. So we don't have enough data to fill the batches on all replicas. In that situation, we need to pad the input with batch size zero inputs to make sure all replicas perform the same number of steps. This way, all-reduce doesn't freak out waiting for something from a replica that isn't running a step at all. Note that the all-reduce operations that we're going to be doing inside the step are on gradients, and those gradients are going to have the same shape as the variables and not dimension that depends on the batch size. In those replicas where there's a batch size or input, we're going to have a zero gradient, but at least it'll be the right shape. Experimental run V2 returns per-replica value combining the return value of replica step from each replica. In this case, each replica is returning a vector with a per example loss value with size equal to the per-replica batch size, where we then use the reduce API to average the loss into an ordinary tensor. By specifying axis equals zero, it will average across the batch dimension and across all the replicas to convert a global batch of loss values into a single scalar. Lastly, here is a simple, standard outer loop. We're going to iterate through all the epochs that we're executing. It runs outside of the function in eager mode, so you have a lot of flexibility to run whatever logic you want. For example, you could put early stopping logic here. You can also after each epoch, checkpoint or maybe eval. This is completely straightforward, since myriad variables implement the checkpoint saving protocol. So they save the same way as normal variables. We have tests that verify that the resulting checkpoints can be loaded by a non-distributed model and vice versa. So I talked about using strategy.reduce at the end, after the experimental run call. There are some alternatives. strategy.concat-- not quite implemented yet-- but it's another way of getting values out in a way that doesn't really depend upon how it was split up into different pieces for the data parallel computation. You might also want to call just get the results on this local worker. And that's really important if you were going to do some further computation and you don't want if you-- like, you're in one of these between graph settings where you have multiple main loops, and you don't want two main loops using the data from any other worker. This is-- the last thing I'm going to cover is making a library that needs to be distributed, remember, because it operates [INAUDIBLE] what APIs you might use. So the first, easiest thing to know about is tf.distribute.get_strategy is how you get a strategy. And the important thing to know about it is that it always returns you something implementing the strategy API, even if you're not into the strategy scope, because there is a default strategy that does something moderately sensible even if you don't have knowledge about what's going on because you're in some strategy scope that has a specific configuration. So distribution strategy aware code is just code that uses this get_strategy API and does its work via those APIs. And most of the work is already done for you for the normal cases, as long as you're just implementing a new metric and optimizer loss. You just have to inherit from the base class that has done all of the work to be distributed enabled. There are new capabilities available to you, though, if you want to be distribution-aware, and I'll talk a little bit about that, those new APIs and options available to you. But first, I'm just going to sort explain the implementation. For losses, we provide helpers for per example and regularization losses that are distributed-aware. If you can supply the global batch size, there is no actual need to do anything distribute-specific, and we can just scale the loss by the value that is constant for all batches, including the last partial batch, and weigh each example equally. Otherwise, we compute the per-replica batch size from the tensor shape and scale it by the number of replicas from the current strategy to get the global batch size. This might be slightly wrong in that it'll weight the last batch slightly more but is the best we can do without knowing the global batch size. So now going into Optimizer-- Optimizer, we made much bigger changes. So we're going to look at the Apply gradients call. That's past a parallel list of gradients and variables. And again, we get the strategy and then we do the thing called a merge call. Now merge call is the sort of secret weapon we developed at the beginning of creating distribute strategy. It allows-- it's basically our main tool for doing things that cross replicas when inside a per-replica call. And so we can do synchronization or communication inside a merge call. The way it works is when the mirrored strategy is running something on each replica, it's actually running each of those functions in a separate thread. So each thread corresponds to one replica, but we only are running one replica thread at a time. We use [INAUDIBLE] and so forth so that the first replica runs until completion or it gets to a merge call. If it gets to a merge call we say, OK, pause that thread. Run the next replica thread until it gets up to the same point in the code, and it gets up to that merge call. Then repeat that until we've gotten all of the replica threads up to that merge call point. And we have args from each thread, and now we aggregate all the args across produced on all those threads into per-replica values. And then we call that function that we pass to the merge call once with these sort of aggregate values across from all the merge calls. Now that function now can do things, like reductions and whenever. They cross all those replicas, and whatever is returned by that function is then returned by it as the return value for all of the merge calls. And then we resume the first replica thread until it finishes, and so on, and so forth. So this distributed_apply function is the argument, the thing that's inside the merge call. So it's only executed once, even though apply gradients calls executed for each replica. And the grads and vars value here is now, instead of being a list of variables and a list of gradients, a list of mirrored variables and a list of per-replica gradients. Now, we want those per-replica gradients to be aggregated across all the replicas, so we do a reduction, where we add them up. And this batch reduce too will reduce across to all of the gradients at once, but it'll put each gradient after aggregation on the devices where the corresponding variable lives. So in the mirrored case, this is an all reduction, but in the parameter server case, each gradient might be going to a variable living on a different parameter server, potentially. And so it know-- takes the variable as the destination, so it can know where to put the aggregated gradient value. And then with those reduced gradients, we then call update, which calls this apply gradient to update variable function on once per device where the variable is. And the gradients now, gradient values here are now mirrored variables and update validates that those are mirrored variables so that we can be sure that the update is the same across all copies of the variable. So this is that sort of subset of the programming. The state transition diagram that you can see is restricted to just for normal variable updates. This introduces another concept, which is the sort of locality of the devices where all these things are running. Replica devices are the devices where we're doing most of the computation. Variable devices are devices where the variable lives, which may be one or many, depending on if it's mirrored or not. And the reduce_to API is the bridge that gets us from one to the other. We also have this non-slot concept that needed for in order to match the behavior ADAM optimizer. So we have the standard pattern that is generally how we update state. The merge_call is taking tensors for each replica and producing per-replica containers. Then reduce_to is turning those produced per-replica containers into aggregate values are mirrored. And then we update, taking the mirrored values to update the values. And we know because they have the mirrored type, the updates are identical. So I see that we're a little bit low on time, so I'm just going to breeze through this. This is the fact that we've overloaded the operations on mirrored variables so that like, for example, assign operations will do all of those steps for you as long as you set an aggregation. That aggregation ends up being the reduce operation that we do. One of the new things you get can do now with distribution strategy, though, is say, we're going to actually opt into a different model for how we're going to update the variables. We instead could sink-- instead of syncing on write, like we do for mirrored variables, we're going to sync-on-read, which means these variables are going to be totally, when at write time, independent. And we're going to keep writing, writing, writing, writing, writing to them, assuming reads are rare. And then when we read, that's when we do the reduction to aggregate the value across all the different variables on the different replicas. These aren't trainable, but they're really great for at least a couple of cases we've seen. One is metrics and batch norms statistics that are ones that are used only in avow. And you get those. So we have the synchronization aggregation arguments that we've added as new APIs in order to access these new capabilities. And so you can set, for example, synchronization to ON_READ in order to get for metrics and batch norm statistics. And variable aggregation can be set, even for mirrored variables but really both. And it lets you say what reduction to do when you do operations. You don't need to set that, though, if-- you don't need to set the aggregation if you're only using optimizers. Optimizers don't rely on this. I'm just going to breeze through this. OK, so here's an example. Metrics just set-- change the add weight methods to change the defaults so that SUM is the default aggregation [INAUDIBLE] and ON_READ is the default synchronization method. And then when you implement a subclass of it, so you'll see there's-- we just are computing the numerator and denominator in the update method and those are using variables created using add_weight. This just is updating the per-replica values within each replica, and we may be doing a parallel eval on a bunch of different replicas. And then at the end, we do this result function, and only then do we aggregate across replicas. And so we get a total numerator, a total denominator, and divide and we get a value. And you'll notice that we didn't have to use anything strategy specific. It's purely the operator overloading all the logic in the sub needed for distributed enabling is in the parent class. You could imagine layers also wanting to use all-reduce. Here's some code that will do it. This does introduce synchronization replicas and could be used optionally for batch norm layers. This is one of the-- batch norm layers are one of the very few cases where you will have interaction between batch elements and so if you don't have enough examples on each replica, you want to communicate across replicas to get enough batch norm statistics. They also use sync-on-read for those avows, as mentioned earlier. A few more obscure methods, we generally hide these in the strategy extended so users don't have to see it. So variable_created_in_scope is used for error checking by Keras. colacate_vars_with is used for slot variables and optimizers. That's it. Sorry I don't have a great conclusion, except for if you have more to ask about this, we have a component label on GitHub called dist-strat. Ask away. [APPLAUSE] AUDIENCE: I have a question. Say I want to implement a custom layer. When I call add_weights, what do I need to care about to make it work with distribution strategy? SPEAKER: Usually nothing. The defaults are all set up in the base class so that unless there's something weird about your layer where like, it has interaction between different batch elements, like in the batch norm case, your layer probably just doesn't have to care. AUDIENCE: I see. SPEAKER: Because all that logic is in the base class. AUDIENCE: I see. And is mirrored strategy only have-- does it only have mirrored variables? SPEAKER: Right now, you will get mirrored variables by default if you use mirrored strategy, but you can opt into these weird sync-on-read variables by setting options that are not about default value. AUDIENCE: OK, is optimizer.iterations finding [INAUDIBLE] sync-on-read variable? SPEAKER: I think it's supposed to be a mirrored variable. AUDIENCE: It is a mirrored variable. SPEAKER: It should be. It should be mirrored variable, unless there's a bug. AUDIENCE: I see. AUDIENCE: It is a mirrored variable. SPEAKER: OK, great. We have our expert here to tell us. AUDIENCE: Cool. AUDIENCE: Do parameters servers work with accelerators? Because you can imagine in the limit that your accelerators have really, really high interconnect, which I know a lot of them do and are moving towards, that like, mirrored would be too conservative. And you'd like to say, round-robin your convolution variables around the accelerators. Like, can you do that, or is that planned? SPEAKER: If you have exotic ideas, we can talk to you. Right now, we are more focused on more basic use cases. For accelerators with good interconnect, we see that more all reduce style has been more efficient. The problem you might run into, though, is either running out of memory or if your steps take widely different times because like, maybe you're doing an RMN with widely different steps, that the synchronization overhead of saying that we're doing everything in lockstep might be a high cost. AUDIENCE: In particular, I was thinking about the memory of having [INAUDIBLE] copy on [INAUDIBLE]. SPEAKER: Yeah, it's sort of hard to avoid that because you need to-- you can do model parallelism to avoid reducing the memory. But the mesh TensorFlow is probably a better path there.
B1 strategy replica variable batch model distribute Inside TensorFlow: tf.distribute.Strategy 4 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary