Subtitles section Play video Print subtitles OSCAR RAMIREZ: All right. Well, thank you, everyone. So I'm Oscar Ramirez. This is Sergio. And we're from the TF-Agents team. And we'll talk to you guys about our project. So for those of you that don't know, TF-Agents is our reinforcement Learning library built in TensorFlow. And it's hopefully a reliable, scalable, and easy-to-use library. We packaged it with a lot of Colabs, examples, and documentation to try and make it easy for people to jump into reinforcement learning. And we use it internally to actually solve a lot of difficult tasks with reinforcement learning. In our experience, it's been pretty easy to develop new RL algorithms. And we have a whole bunch of tests, making it easy to configure and reproduce results. A lot of this wouldn't be possible without everyone's contribution, so I just want to make it clear, this has been a team effort. There has spent a lot of 20 percenters, external contributors. People have come and gone within the team, as well. And so this is right now the biggest chunk of the current team that is working on TF-Agents. With that, I'll let Sergio talk a bit more about RL in general. SERGIO GUADARRAMA: Thank you, Oscar. Hi, everyone. So we're going to focus a little more about reinforcement learning and how this is different from other kinds of machine learning-- unsupervised learning, supervised learning, and other flavors. Here's three examples. One is a robotics game. And the other one is a recommendation system. That's a clear example where you can apply reinforcement learning. So the basic idea is-- so if you were to try to teach someone how to walk, it's very difficult, because it's really difficult for me to explain to you what you need to do to be able to walk-- coordinate your legs, in this case, of the robot-- or even for a kid. How you teach someone how to walk is really difficult. They need to figure it out themselves. How? Trial and error. You try a bunch of times. You fall down. You get up, and then you learn as you're falling. And that's basically-- you can think of it like the reward function. You get a positive reward or a negative reward every time you try. So here, you can see also, even with the neural algorithms, this thing is still hopping, no? After a few trials of learning, this robot is able to move around, wobble a little bit, and then fall. But now he can control the legs a little more. Not quite walk, but doing better than before. After fully-trained, then, the robot is able to walk from one place to another, basically go to a specific location, and all those things. So how this happen basically is summarizing this code. Well, there's a lot of code, but overall, the presentation will go about the details. Basically, we're summarizing all the pieces you will need to be able to train a model like this, but we will go into details. So what is reinforcement learning, and how that is different, basically, for other cases? The idea is we have an agent that is trying to play, in this case, or interact with an environment. In this case, it's like Breakout. So basically, the idea is you need to move the paddle to the left or to the right to hit the ball and break the bricks on the top. So this one generates some observation that the agent can observe. It can basically process them for observation, generate a new action-- like whether to move the paddle to the left to the right. And then based on that, they will get some reward. In this case, it will be the score. And then, using that information, it will learn from this environment how to play. So one thing with this, I think, is critical for people who have done a lot of supervised learning is, what is the main difference between supervised learning and reinforcement learning-- is that for supervised learning, you can think of, for every action that you take, they give you a label. An expert will have labeled that case. That is simple. It'll give you the right answer. For this specific image, this is an image of a dog. This is an image of a cat. So you know what is the right answer, so every time you make a mistake, I will tell you what is the right answer to that question. In reinforcement learning, that doesn't happen. Basically, you are playing this game. You are interacting with the game. You [? bench ?] a batch of actions, and you don't know which one was the right action, what was the correct action, and what was the wrong one. You only know this reward function tells you, OK, you are doing kind of OK. You are not doing that well. And based on that, you need to infer, basically, what other possible actions you could have taken to improve your reward, or maybe you're doing well now, but maybe later you do worse. So it's also a dynamic process going on over here. AUDIENCE: How is the reward function different from the label? SERGIO GUADARRAMA: So I think the main difference is this. The reward function is only an indicator you are doing well or wrong, but it doesn't tell you what is the precise action you need to take. The label is more like the precise outcome of the model. You can think, in supervised learning, I tell you what is the right action. I tell you the right answer. If I give you a mathematical problem, I'm going to say, x is equal to 2. That is the right answer. If I tell you, you are doing well, you don't know what was the actual answer. You don't know if it was x equal 2 or x equal 3. If I tell you it's the wrong answer, you're going to know what was the right answer. So basically that's the main difference between having a reward function that only indicates-- it gives you some indication about whether you are doing well or not, but doesn't give you the proper answer-- or the optimal answer, let's say. AUDIENCE: Is the reward better to be very general instead of very specific? SERGIO GUADARRAMA: Mhm. AUDIENCE: Like you are doing well, instead of what you are moving is the right direction to go. OSCAR RAMIREZ: It depends quite a bit on the environment. And there is this whole problem of credit assignment. So trying to figure out what part of your actions were the ones that actually led to you receiving this reward. So if you think about the robot hopping, you could give it a reward, that may be its current forward velocity. And you're trying to maximize that, and so the robot should learn to run as fast as possible. But maybe bending the legs down so you can push yourself forward will help you move forward a lot faster, but maybe that action will actually move you backwards a little bit. And you might even get punished instantaneously for that action, but it's part of the whole set of actions during an episode that will lead you to moving forward. And so the credit assignment problem is like, all right, there are a set of actions that we might have even gotten negative reward, but we need to figure out that those actions led to positive reward down the line. And the objective is to maximize the discounted return. So a sum of rewards over a length of time steps. SERGIO GUADARRAMA: Yeah, that's a thing that's a critical part. We care about long-term value. It's not totally immediate reward. It's not only you telling me plus 1. It's not so important, because I want to know not if I'm playing the game right now. If I'm going to win the game at the end, that's where I really care. I am going to be able to move the robot to that position. What we're in the middle, sometimes those things are OK. Some things are not bad. But sometimes, I make an action. Maybe I move one leg and I fall. And then I could not recover. But then maybe it was a movement I did 10 steps ago would make my leg wobble. And now how do I connect which action made me fall. Sometimes it's not very clear. Because it's multiple actions-- in some cases, even thousands of actions-- before you get to the end of the game, basically. You can think also that, in the games, now that I've gotten all those things, is this [? stone ?] really going to make you lose? Probably there's no single [? stone ?] that's going to make you lose, but 200 positions down the line, that [? stone ?] was actually very critical. Because that has a ripple effect on other actions that happen later. And then to you need to be able to estimate this credit assignment for which actions I need to change to improve my reward, basically, overall. So I think this is to illustrate also a little farther, different models of learning. What we said before, supervised learning is more about the classical classroom. There's a teacher telling you the right answer, memorize the answer, memorize that. And that's what we do in supervised learning. We almost memorize the answers with some generalization. Mostly that's what we do. And then in reinforcement learning, it's not so much about memorize the answer. Because even if I do the same actions, in a different setting, if I say, OK, go to the kitchen in my house, and I say, oh, go to the left, second door to the right. And then I say, OK, now go to [? Kate's ?] house and go to the kitchen. If you apply the same algorithm, you will get maybe into the bathroom. Like, you'll go two doors to the right, and then you go to the wrong place. So even memorizing the answer is not good enough. You know what I mean? You need to adapt to the environment. So that's what makes reinforcement learning a little more challenging, but also more applicable to many other problems in reality. You need to play around. You need to interact with the environment. There's no such a thing as, I can think about what's going to be the best plan ahead of time and never play with the environment. We tried to write down some of these things that we just mentioned, about that you need to interact with the environment to be able to learn. This is very critical. If you don't interact, if you don't try to walk, if the robot doesn't try to move, it cannot learn how to walk. So you need to interact with the environment. Also it will put you in weird positions and weird places, because you maybe ended up in the end of a corridor, or in a [INAUDIBLE] position, or maybe even unsafe cases. There's another research also going on about safe RL. How do I explore the world such as I don't break my robot? Like, you make a really strong force, you may break the robot. But probably you don't want to do that. Because you need to keep interacting with the world. And also, we collect data while we're training. So as we're learning, we're collecting new data, fresh data all the time. So the data set is not fixed like in supervised learning. We typically assume in supervised learning that we have initial data set at the beginning, and then you just iterate over and over. And here, as you learn, you get fresh data, and then the data changes. The distribution of the data changes as you get more data. And you can see that also for example in a labyrinth. You don't know where you're going. At the beginning, you're probably lost all the time. And you maybe ended up always in the same places. And maybe there's different part of the labyrinth you never explore. So you don't even know about that. So you cannot learn about it, because you have never explored it. So the exploration is very critical in RL. It's not only you want to optimize and exploit the model that you have. You also need to explore. Sometimes, you actually need to do what you think is the wrong thing, which is basically go to the left here, because you've never been there, just to basically explore new areas. Another thing is like what we said before, nobody's going to tell you what is the right answer. And actually, many cases there's not a right answer. There are multiple ways to solve the problem. The reward only gives you an indication you are going the right path or not. But it doesn't tell you what is the right answer. To train this model, we use a lot of different surrogate losses, which means also they are not actually correlated with performance. Usually, it's very common, and you will see in a moment-- when the model is learning, the loss goes up. When the model is not learning, the loss goes down. So basically, loss going down is usually a bad sign. If your losses stay at zero, you are learning nothing. So you will see in a second, how you debug these models became more and more tricky than supervised learning. We look at loss, our losses go down. Beautiful. And you take [INAUDIBLE] and the loss always goes down. You do something wrong, the loss goes up. Otherwise, the loss keeps going down. In RL, that's not the case. First, we have multiple losses to train. And many of them actually don't correlate with performance. They will go up and down-- it looks like random, almost. So it's very hard to debug or tune these algorithms because of that. You actually need to evaluate the model. It's not enough, the losses is not enough to give you a good sense if you are doing well or not. In addition to that, that means we require multiple optimizers, multiple networks, multiple ways to update the variables, and all those things. Which means the typical supervised learning training loop or model fit doesn't fit for RL, basically. There's many ways we need to update the variables. None of them use optimizers, some of them we have multiple optimizers with different frequencies, with different ways. Sometimes we optimize one model against a different model and things like that. So basically, how we update the models is very different from the typical way of supervised learning, even though we use some supervised learning losses, basically. Some of the losses are basically supervised, basically regression losses, something like that, or cross-entropy. So we use some of those losses by different ways, basically. So probably, this graph is not very surprising to most people who have used supervised learning in the last years. It used to be different in the past. But now with neural networks, it usually always looks like this. You start training your model, your classification loss goes down. Usually your regularization goes up, because your [INAUDIBLE] is actually learning something, so they are moving. But your total loss is still, the overall total losses still go up. Regularization loss tends to stabilize, usually, or go down after learning. But basically, usually you can guide yourself by your cross-entropy loss or total loss to be a really good guide that your model is learning. And if the loss doesn't go down, then your model is not learning, basically. You know that. I still remember when I was outside Google and trying to train neural net, the first neural net. And I couldn't get the loss down. The loss was stable. And initialization couldn't get it back down. And then I need to ask Christian [? Szegedy, ?] like, what do you do? How do you did it? He's like, oh, you need to initialize the variables this way. You have to do all these extra tricks. And when I did all the tricks he told me, all of a sudden the losses start going down. But once the losses start going down, the model starts learning very quickly. This is what it looks in many cases in RL. We have the actor loss that's going up. In this case, it's actually good, because it's learning something. We have this alpha loss, which is almost like noise around zero, fluctuates quite a bit. And the critic loss in this case just collapsed, basically. At the beginning, it was very high, and all of a sudden it got very small, and then it doesn't move from there. But this model is actually good. This model is learning well. [CHUCKLES] So you see all these-- and there's not like a total loss. You cannot aggregate these losses, because each one of these losses is optimized in different part of the model. So we optimize each one of them individually. But the other cases that you see the losses, and then usually, the loss will go up, especially sometimes when you're learning something, because you can think about it this way. You are trying to go through the environment, and you see a new room you've never seen. It's going to be very surprising for the model. So the model is going to try to fit this new data, and it's basically going to be out of the distribution. So the model is going to say, I don't know, this looks really different to everything I've seen before. So the loss goes up. When it basically learns about this new data, then the loss will go down again. So it's very common that we have many patterns that the loss goes up and down as the model starts learning and discover more rooms and more spaces in the environment. AUDIENCE: But how do we know the model is doing well if we don't-- SERGIO GUADARRAMA: So we need to look basically at the reward. So the other function that we said that we actually compute the [INAUDIBLE] reward. And then basically we take a model, run it through the environment, and compute how well it's performing. So the loss itself doesn't tell us that. AUDIENCE: You're talking about not the rewards during the training, but a separate reward where you-- SERGIO GUADARRAMA: You can do both. You can do both. You can compute reward during training. And that already give you a very good signal. AUDIENCE: But during the training, it would be misleading. Because you haven't explored something, then you won't see that it wasn't really good. SERGIO GUADARRAMA: It's still misleading, exactly. Yeah. So we do usually both. OSCAR RAMIREZ: And it's even more deceiving, because when you have a policy that you're using to collect data to train on, you most of time, will have some form of exploration within that. Every 10 steps you'll do a random action, and that will lead to wildly different rewards over time. AUDIENCE: But why is it not misleading even if you do it separately from training? Because ultimately, if your policy is such that it doesn't really explore much, it will always-- when you throw that policy into a test environment, and you no longer modify it, whatever, but it might still-- if the policy is just very naive and doesn't want to explore much, it would look great, because it does everything fine. But how would you know that it actually hasn't left-- OSCAR RAMIREZ: So when we're doing evaluations, we want to exploit what we've learned. So at that point, we're trying to use this to complete the task that we're trying to accomplish by training these models. And so there, we don't need to explore. Now we're just trying to exploit what we've learned. AUDIENCE: But if it's not ready to react to certain things that-- like, if it hasn't explored the space so that in common situations it would still do well, but it hasn't explored it enough that if it encounters some issues it doesn't know what to do, then that would not be really reflected by the reward. OSCAR RAMIREZ: Yeah. So you need to evaluate over a certain number of episodes. And Sergio has a slide on like-- SERGIO GUADARRAMA: Like, maybe-- probably what you say. Like, actually, evaluating once is not enough. We usually evaluate it maybe 100 times, from different initial conditions, all of that. And then we average. Because it's true. It could be, you evaluate once, maybe it looks fine. You never went to the wrong place. You never fall off the cliff. You're totally fine. You evaluate 100 times, one of them will go off the cliff. Because it was going to be one situation [INAUDIBLE] as well. AUDIENCE: Also, do you mind clarifying the three types of losses, what they correspond to? SERGIO GUADARRAMA: So basically here, the actor loss here corresponds to this policy, distinguish acting in the environment. Like, I need to make a decision about which action to take. So we have a model which is basically saying, which action I'm going take right now. I'm going to move the paddle to the left or to the right? So that will be your actor. And we have a loss to train that model. Then the critic loss is slightly different. It's going to say, OK, if I'm in this situation and I were to perform this action, how good will that be? So I can decide should I take right, or should I take left? So it's trying to give me a sense, is this action good in this state? And then basically, that's what we call the critic. And then usually, the critic is used to train the actor. So the actor will say, oh, I'm going to go to the right. And the critic will say, oh, you go to the right, that's really bad. Because I know I give you a score by negative score. So you should go to the left. But then the critic then will learn the critic, basically, by seeing these rewards that we [? observe ?] during training. Then that gives us basically this [? better ?] reward that the critic can learn from. So the critic is basically regressing to those values. So that's the loss for the critic. And in this case, this alpha loss is basically how much exploration, exploitation I should do. It's like, how much entropy do I need to add to my model in the actor? And usually, you want to have quite a bit at the beginning of learning. And then when you have a really good model, you don't want to explore that much. So this alpha loss is basically trying to modulate how much entropy do I want to add in my model. AUDIENCE: So I have often the entropy is going up during our training. But why the actor loss in your example is also constantly going up in your training? SERGIO GUADARRAMA: In the actor loss? AUDIENCE: The actor loss. OSCAR RAMIREZ: Yeah. So basically, what happened is, as I mentioned, the actor loss is trained based on the critic also. So basically, the actor is trying to predict which actions should I make? And the critic is trying to criticize, this is good, this is bad. So the critic is also moving. So the critic, as the critic learns and be better at scoring this is a good action or not, then the actor needs to change to that. So this also, you can think also of this is like a game going on a little bit. You know, it's not exactly a game, because they don't compete against each other. But it's like a moving target. And sometimes, the better the critic, the less the actor needs to move around. Usually it's stabilized. The actor loss tends to stabilize way more than the critic loss. The critic loss I have seen in other cases-- this one is very stable. But in many other cases, the critic loss goes up and down much more substantially. And going back to the question that you did before about, how do we know we're doing well? Because what I told you so far is like, there's all these losses that doesn't correlate. When we evaluate, we actually don't know how well are we doing. And even more profound is like, you look to the graph on the left, where there's actually two graphs, same algorithm trying to solve the same task-- the orange and the blue. The only difference between these two graphs-- so higher is better, so higher return like this, you are getting better performance-- is actually statistically much better than the blue one. But the only difference between these two runs are the random seeds. Everything else is the same. It's the same code, the same task. Everything is the same. The only thing that changed is the random seed. It's basically how the model was initialized. AUDIENCE: The random seed for the training, or the random seed for the evaluation? OSCAR RAMIREZ: The random seed for the training. Yeah. And then for the evaluation, we will usually run probably-- I don't remember-- probably 100 random seeds different for every time that you're evaluating here, you would run 100. So to tackle this, what we did is, this work with Stephanie, we were like, can we actually measure how reliable is an algorithm? Because RL algorithms are not very reliable, and it's really hard to compare one algorithm to another, one task to another, and all those things. So we basically did a lot of work. And we have a paper on the code available to basically measure these things. Like, can I statistically measure, is this algorithm better than this one? And not only is it better, is it reliable? Because if I train 10 times and I get 10 different answers, maybe one of them is good. But it's not very reliable. I cannot apply to a real problem, because every time I train, I get a very different answer. So basically, the broader these curves are, the less reliable it is, because I will get every time I train-- I think this one we trained 30 different times-- and then you see some algorithms will have broader bands, and some others will have narrow bands. So the algorithm that have narrow bands are more reliable. So we have ways to measure those, different metrics. AUDIENCE: But don't you only care about the final point? Why would you care about the intermediate points? SERGIO GUADARRAMA: You care about both, because let's think about it like, for example, if you cannot reliably get the final point, it's not good. If one algorithm say-- we have some algorithms that do that. It's not here, because they are so bad. Like only one in 100 that will get a really high number. You train 100 times, one of them will be really good, 99 will be really bad. So the question of which algorithm do you want to use for your model? One that 1 in 100 times you run will give you a good answer, and it would be really good? Or some one which is maybe not as good, but consistently will give me maybe 90% of the other one? So basically, we provide different metrics so you can measure all those different things. But be mindful of what you choose. The final score is not the only thing that you care, usually, for comparing algorithms. You just want a policy, like you just want to solve this problem, yeah, the final score is the only thing you care. But if we want to compare algorithms, I want to compare, can I apply this algorithm to a new task? If I need to run it 100 times every time I change the task, it's not going to be a very good, very reliable algorithm. OK. I think we're back to Oscar. OSCAR RAMIREZ: Cool. So now that we saw all the problems, let's see what we actually do in TF-Agents, try and address and make it possible to play with these things. So to look at a bigger picture of the components that we have within TF-Agents, we have a very strict separation of how we do our data collection versus how we do our training. And this has been mostly out of necessity, where we need to be able to do data collection in a whole bunch of different types of environments, be it in some production system, or on actual real robots, or in simulations. And so we need to be able to somehow deploy these policies that were being trained by these agents, interact with this environment, and then store all this data so that we can then sample it for training. And so we started looking first at what do the environments actually look like. And if you look at RL and a lot of the research, there is OpenAI Gym and a lot of other environments available through that. And so for TF-Agents, we make all these available and easy to use within the library. This is just a sample of the environments that are available. And so defining the environments, we have this API, where we can define the environment. Let's for example think, what happens if we want to define Breakout? The first thing that you need to do is define what your observations and actions are going to look like. This comes a little bit back from when we started when we were still in TF 1, and we really needed these information for building the computation graph. But it's still very useful today. And so what these specs, they're basically nested structures of TensorFlow specs that fully define the shapes and types of what the observations will look like and what the actions will look like. And so we think, specifically for Breakout, maybe the observation will be the image of the game screen. And the actions will probably be moving the paddle left, moving it right, and maybe firing, so that you can actually launch the ball. So once you've defined what your data is going to look like, there's two main methods and environments that you have to define as a user-- how the environment gets reset, and how the environment gets stepped. And so a reset will basically initialize the state of the environment and give you the initial observation. And when you're stepping, you'll receive some action. If the state of the environment is that we reach the final state, it will automatically reset the environment. Otherwise, it will use that action to transition from your current state to a next state. And this will give you the next state's observation and some reward. And we encapsulate this into a time step that includes that kind of information. And so if we're wanting to play Breakout, we would create an instance of this environment. We'll get some policy, either scripted or from some agent that we're training. And then we would simply iterate to try and figure out, all right, how well are we doing over an episode? This is basically a simplification of what the code would look like if we were trying to evaluate how good a specific policy is on some environment. In order to actually scale and train this, it means that we actually have to be collecting a lot of data to be able to train on these environments and with these methods. And so we provide the tooling to be able to parallelize this. And so you can create multiple instances of this environment and collect data in a batch setting, where we have this TensorFlow wrapper around the environment that will internally use NumPy functions to interact with the Python environment, and will then batch all of these instances and give us batched time steps whenever we do the reset. And then we can use the policy to evaluate and generate actions for every single instance of this environment at the same time. And so normally when training, we'll deploy several jobs that are doing collection in a bunch of environments at the same time. And so once we know how to interact with the environment, you can think of the driver and the observer. These are basically like a For loop. There's an example down the line. But all of that data will be collected somewhere. And in order to do training, what we do is we rely on the data set APIs to be able to sample experience out of the data sets that we're collecting. And the agent will be consuming this experience and will be training the model that it has. In most situations, it's a neural network. In some of the algorithms, it's not even a neural network, in examples like bandits. And so we're trying to train this learnable policy based purely on the experience, that is, mostly the observations that we've done in the past. And what this policy needs to do is, it's a function that maps from some form of an observation to an action. And that's what we're trying to train in order to maximize our long-term rewards over some episode. And so how are these policies built? Well, first we'll have to define some form of network to back it or to generate the model. In this case, we inherit from the Keras networks and add a couple of utility things, especially to be able to generate copies of these networks. And here will basically define, all right, we'll have a sequential model with some conv layers, some fully connected layers. And then if this was, for example, for DQN, we would have a last layer that would give us a predicted Q value, which is basically predicting how good is this action at a given state, and would tell us what probabilities we should be sampling the different kinds of actions that we have. And then within the call method, we'll be taking some observation. We'll iterate over our layers and generate some predictions that we want to use to generate actions. And then we have this concept of a policy. And the policy, what it will do is, it will know, given whatever algorithm we're trying to train, the type of network that you're training might be different. And so in order to be able to generalize across the different algorithms or agents that we're implementing, the concept of the policy we'll know, given some set of networks, how do we actually use these to take observations and generate actions. And normally, the way we do this is that we have a distribution method that will take this time step and maybe some policy state-- whatever you're training, some recurring models, for example-- and we'll be able to apply this network and then know how to use the output of the network in order to generate either some form of distribution-- in some agents, this might be a deterministic distribution-- that we can then sample from. And then when doing data collection, we might be sampling from this distribution. We might add some randomness to it. When we're doing evaluations, we'd be doing a greedy version of this policy, where we'll take the mode of this distribution in order to try to exploit the knowledge that we've gathered, and try to maximize our return over the episodes when evaluating. And so one of the big things with 2.0 is that we can now rely on saved models to export all these policies. And this made it a lot easier to generalize and be able to say, oh, hey, now it doesn't matter what agent you use to train. It doesn't matter how you generated your network. You just have the saved model that you can call action on. And you can deploy it on to your robots, production, wherever, and collect data for training, for example, or for serving the trained model. And so within the saved model, we generate all these concrete functions, and save and expose an action method, getting an initial state-- again, for the case where we have recurring models. And we also get the training step, which can be used for annotating the data that we're collecting. And right now, the one thing that we're still working on, or that we need to work on, is that we rely on tests for probability for a lot of the distribution stuff that we use. But this is not part of core TensorFlow. And so saved models can generate distributions easily. And so we need to work on that a little bit. The other thing that we do is that we generate different versions of the saved model. Depending on whether this policy will be used for data collection versus for evaluation, it'll have baked in whatever exploration strategy we have within the saved model. And right now, I'm working on making it so that we can easily load checkpoints into the saved model and update the variables. Because for a lot of these methods, when we're generating the saved models, we have to do this very frequently. But the saved model, the computation graph that it needs to generate, it's the same every step. And so right now we're saving a lot of extra stuff that we don't need to, and so just being able to update it on the fly-- but overall, this is much easier than what we had to do in TF 1, where we were stashing placeholders in collections, and then being able to rewire how we were feeding data into the saved models. AUDIENCE: So one question about-- you talk about distribution part in saved model. So if your function fit into saved model, the save is already a distributed function, then it should be able to support-- like, you can dump-- OSCAR RAMIREZ: So we can have the distributions within it. But we can't easily look at those distributions and modify them when we deploy it. Like, the return of a saved model function cannot be a distribution object. It can only be the output of it. SERGIO GUADARRAMA: It can only be a tensor, basically. The only outputs that the concrete functions take in and out are tensors. It cannot be an actual distribution, not yet. Because the other thing, sometimes we need to do sampling logics. We need to do functions that belong to the distribution object. AUDIENCE: I see. SERGIO GUADARRAMA: So we do some tricks in replay buffer and everything, basically, that it's stored information that we need to reconstruct the distribution back. I know this object is going to be a categorical distribution, and because I know that then I can basically get the parameters of the categorical distribution, rebuild the object again with these parameters. And now I can sample, I can do all these other things from the distribution. Through the saved model, it's still tricky. I mean, we can still save that information. But it's not very clear how much information should be part of the saved models, or it's part of us basically monkey patching the thing to basically get what we need. OSCAR RAMIREZ: And the other problem with it is that, as we export all these different saved models to do data collection or evaluation, we want to be able to be general to what agent trained this, what kind of policy it really is, and what kind of network is backing it. And so then trying to stash all that information in there can be tricky as well to generalize over. And so if we go back circle now, we have all these saved models, and all these are basically being used for data collection. And so collecting experience, basically, we'll have, again, some environment. Now we have an instance of this replay buffer, where we'll be putting all this data that we're collecting on. And we have this concept of a driver that will basically utilize some policy. This could be either directly from the agent, or it could be a saved model that's been loaded when we're doing it on a distributed fashion. And we define this concept of an observer, which will-- as the driver is evaluating this policy with the environment, every observer that's passed to the driver will be able to take a look at the trajectory that was generated at that time step and use it to do whatever. And so in this case, we're adding it to the replay buffer. If we're doing evaluation, we would be computing some metrics based on the trajectories that we're observing, for example. And so once you have that, you can actually just run the driver and do the data collection. And so if we look at the agents, we have a whole bunch of agents that are readily available in the open-source setup. All of these have a whole bunch of tests, both quality and speed regression tests, as well. And we've been fairly selective to make sure that we pick state-of-the-art agents or methods within RL that have proven to be relevant over longer periods of time. Because maintaining these agents is a lot of effort, and so we have limited manpower to actually maintain these. So we try to be conservative on what we expose publicly. And so looking at how agents are defined in their API, the main things that we want to do with an agent is be able to access different kinds of policies that we'll be using, and then being able to train given some experience. And so we have a collection policy that you would use to gather all the experience that you want to train on. We have a train method that you feed in experience, and you actually get some losses out, and that will do the updates to the model. And then you have the actual policy that you want to use to actually exploit things. In most agents, this ends up being a greedy policy, like I mentioned, where in the distribution method we would just call them out to actually get the best action that we can. And so putting it together with a network, we instantiate some form of network that the agent expects. We give that and some optimizer. And there's a whole bunch of other parameters for the agent. And then from the replay buffer, we can generate a data set. In this case, for DQN, we need to train with transitions. So we need like a time step, an action, and then time step that happened afterwards. And so we have this num_steps parameter equal to 2. And then we simply sample the data set and do some training. And yeah. And so normally, if you want to do this sequentially, where you're actually doing some collection and some training, the way that it would look is that you have the same components, but now we alternate between collecting some data with the driver and the environment, and training on sampling the data that we've collected. So this can sometimes have a lot of different challenges where this driver is actually executing a policy and interacting with a Python environment outside of the TensorFlow context. And so a lot of the [? eager ?] utilities have come in really, really handy for doing a lot of these things. And so mapping a lot of these APIs back into the overview, if we start with the replay buffer and go clockwise, we'll have some replay buffer that we can sample through data sets. We'll have the concept of an agent, for example DqnAgent, that we can train based on this data. This is training some form of network that were defined. And the network is being used by the policies that the agents can create. We can then deploy these, either through saved models or in the same job, and utilize the drivers to interact with the environment, and collect experience through these observers back into the replay buffer. And then we can iterate between doing data collection and training. And then recently, we had a lot of help with getting things to work with TPUs, and accelerators, and distribution strategies. And so the biggest thing here is that, in order to keep all these accelerators actually busy, we really need to scale up the data collection rate. And so depending on the environments-- for example, in some cases in the robotics use cases, you might be able to get one or two time steps a second of data collection. And so then you need a couple of thousand jobs just to do enough data collection to be able to do the training. In some other scenarios, you might be collecting data based on user interactions, and then you might only get one sample per user per day. And so then you have to be able to scale that up. And then on the distributed side, all the data that's being collected will be captured into some replay buffer. And then we can just use distribution strategies to be able to sample that and pull it in, and then distribute it across the GPUs or TPUs to do all the training. And then I'll give it to Sergio for a quick intro into bandits. SERGIO GUADARRAMA: So as we have been talking, our role can be challenging in many cases. So we're hoping this subset of RL, what is called multi-armed bandits, we will go a little bit. But this simplifies some of the assumptions, and it can be applied to a [INAUDIBLE] set of problems. But they are much easier to train, much, much easier to understand. So I want to cover this, because for many people who are new to RL, I recommend then to start with bandits first. And then if they don't work still for your problem, then you go and look into a full RL algorithm. And basically, the main difference between multi-armed and RL is basically, here you make a decision every time, but it's like every time you make a decision, the game starts again. So one action doesn't influence the others. So basically, there's no such thing as long-term consequences. So you can make a decision every single time, and that will not influence the state of the environment in the future, which means a lot of things you can assume are simplified in your models. And this one, basically, you don't need to worry about what actions did I take in the past, how do I do credit assignment, because now it's very clear. If I make this action and I get some reward, it's because of this action, because there's no more sequential [? patterns ?] anymore. And also, here you don't need to plan ahead. So basically, I don't need to think about what's going to happen after I make this action because it's going to have some consequences later. In the bandits case, we assume all the things are independent, basically. We assume every time you make an action, you can start again playing the game from scratch every single time. This used to be done more commonly with A/B testing, for people who know what A/B testing does. It's like, imagine you have four different flavors of your, I don't know, site, or problem, or four different options you can offer to the user. Which one is the best? You offer all of them to different users, and then you compute which one is the best. And then after you figure out which one is the best, then you serve that option to everyone. So basically, what happens during the time that you're offering these four options to everyone, some people are getting not the optimal option, basically. During the time you are exploring, figuring out which is the best option, during that time some of the people are not getting the best possible answer. So that is called regret-- how much I could have done better that I didn't do because I didn't give you the best answer from the beginning. So with multi-armed bandits, what its tries to do is, as you go, adapt how much exploration do I need to do, and how confident I am that my model is good. So basically, it will start the same thing as A/B testing. At the beginning, it will give a random answer to every user. But as soon as some users say, oh, this is better, I like it, it will start shifting and say, OK, I should probably go to that option everybody seems to be liking. So as soon as you start figure out-- you are very confident your model is getting better, then you basically start shifting and maybe serving everyone the same answer. So basically, the amount of regret, how much time you have given the wrong answer, decreases faster. So basically, the multi-armed bandit, it tries to estimate how confident I am about my model. When I'm not very confident, I explore. When I become very confident, then I don't explore anymore, I start exploiting. One example that is typically used for understanding multi-armed bandits is recommending movies. You have a bunch of movies I could recommend you. There's some probability that you may like this movie or not. And then I have to figure out which movie to recommend you. And then to make it even more personalized, you can use context. You can use user information. You can use previous things as your context. But the main thing is, you're not going to make a recommendation today, and that is doesn't influence the recommendation I make tomorrow. And so basically, if I knew this was the probability that you like "Star Wars," I probably should recommend you "Star Wars." What happens is, before I start recommend you things, I don't know what do you like. Only when I start recommending you things and you like some things and don't like other things, then I learn about your taste, and then I can update my model based on that. So here, there are different algorithms in this experiment. Some of them-- here, lower is better. Is this regret? It's like, how much can I offer you the optimal solution? Some of them, they're basically very random, and it takes forever, doesn't learn much. Some of them, they just do this epsilon, really, basically randomly give you something sometimes, and otherwise the best. And then there's other methods that use more fancy algorithms, like Thompson sampling or dropout Thompson sampling, where a more advanced algorithm that basically give you better trade-off between exploration and exploitation. So for all those things, we have tutorials, we have a page on everything, so you can actually play with all these algorithms and learn. And I usually recommend, try to apply a bandit algorithm to your problem first. Because it makes more assumptions, but if it works, it's better. It's easier to train and easier to use. If it doesn't work, then go back to the RL algorithms. And these are some of them who are available currently within TF-Agents. Some of them I already mentioned. Some of them use neural networks. Some of them are more like linear models. Some of them use upper bounds about the confidence. So they try to estimate how confident I am about my model and all those things to basically get this exploration/exploitation trade-off right. As I mentioned, you can apply it to many of the recommender systems. You can imagine, I want to make a recommendation, I never know what you like. I try different things, and then based on that, I improve my model. And then this model gets very complicated when you start giving personalized recommendations. And finally, I want to talk a couple of things. Some of them are about roadmaps, like where is TF-Agents going forward. Some of the things we already hit, but for example, adding new algorithms and new agents. We are working on that, for example, bootstrapped DQN, I think, is almost ready to be open-sourced. Before we open-source any of these algorithms, what we do is we verify them. We make sure they are correct, we get the right numbers. And we also add to the continuous testing, so they stay correct over time. Because in the past, it would happen to us also like, oh, we are good, it's ready, we put it out. One week later, it doesn't work anymore. Something changed somewhere in the-- who knows-- in our code base, in TensorFlow code base, in TensorFlow probably. Somewhere, something changed somewhere, and now the performance is not the same. So now we have this continuous testing to make sure they stay working. So we plan to have this leaderboard and pre-trained model release, add in more distributed, especially for replay buffers and distributed collection, distributed training. Oscar was mentioning at the beginning, maybe thinking in the future to add another new environment, like Unity or other environments that people are interested in. This is a graph that I think is relevant for people who are like, OK, how much time do you actually spend doing the core algorithm? You can think of this as the blue box. Basically, that's the algorithm itself, not the agent. And I would say probably 25% of total time is developed into the actual algorithm and all those things. All the other time is spent in other things within the team. Replay buffer is quite a bit time-consuming. TF 2, when we did the immigration for TF 1 to TF 2, it took a really good chunk of our time to make that migration. Right now, our library you can run in both TF 1 and TF 2. So we spent quite a bit of time to make sure that is possible. All the core of the library you can run. Only the binary is different, but the core of the library can run in both TF 1 and TF 2. And usability also, we spent quite a bit of time, like how to make refined APIs. Do I need to change this, how easy is it to use, all those things. And we still have a lot of work to do. So we are not done with that. And tooling. All this testing, all this benchmarking, all the continuous evaluation, all those things, this tooling, we have to build around it to basically make it be successful. And finally, I think, for those of you who didn't get the link at the beginning, you can go to GitHub TensorFlow agents. You can get the package by pip install. You can start learning about using our Colabs or tutorials with DQN-Cartpole. The Minitaur that we saw at the beginning, you can go and train yourself. And the Colab was really good. And to solve important problems. That's the other part we really care about is, make sure we are production quality. The code base, the test, everything we do, we can deploy these models and all the things so you can actually use to solve important problems. Not only-- we usually use games as an example, because they're easy to understand and easy to play around in. But many other cases, we really apply to more real problems. And actually, it's designed with that in mind. We welcome contributions and pull requests. And we try to review as best as we can with new environments, with new algorithms, or new contributions to the library. [MUSIC PLAYING]
A2 model basically loss data sergio reward Inside TensorFlow: TF-Agents 5 0 林宜悉 posted on 2020/04/15 More Share Save Report Video vocabulary