Subtitles section Play video Print subtitles [MUSIC PLAYING] SERGIO GUADARRAMA: Today, we are going to talk about reinforcement learning, how you can apply to many different problems. So hopefully, by the end of the talk, you will know how to use reinforcement learning for your problem, for your applications, what other things we are doing at Google with all these new technology. So let me go a little bit-- do you remember when you try to do something difficult that was hard that you need to try a lot? For example, when you learned how to walk, do you remember? I don't remember. But it's pretty hard because nobody tells you exactly how to do it. You just keep trying. And eventually, you're able to stand up, keep the balance, wobble around, and start walking. So what if we want to teach this cute little robot how to walk? Imagine-- how you will do that? How would you tell this robot how to walk? So what we are going to do today is learn how we can do that with machine learning. And the reason for that is because if we want to do this by calling a set of rules, it will be really hard. What kind of rules would we put in code that can actually make this robot the walk? We have to do coordination, balance. It's really difficult. And then they probably would just fall over. And we don't know what to change in the code. Instead of that, we're going to use machine learning to learn from it. So the agenda for today is going to be this. We are going to cover very quickly what is supervised learning, reinforcement learning, what is TF-Agents, these things we just talk about it. And we will go through multiple examples. So you can see we can build up different pieces to actually go and solve this problem, teach this robot how to walk. And finally, we will have some take home methods that you can take with you all today. So how many of you know what is supervised learning? OK. That's pretty good. For those of you who don't know, let's go to a very simple example. So we're going to have some inputs, in this case, like an image. And we're going to pass through our model, and we're going to put in some outputs. In this case, there's going to be a cat or the dog. And then, we're going to tell you what is the right answer. So that's the key aspect. In supervising learning, we tell you the label. What is the right answer? So you can modify your model and learn from these mistakes. In this case, you might use a neural net. We have a lot of ways that you can learn. And you can modify those connections to basically learn over time what is the right answer. The thing that supervised learning need is a lot of labels. So many of you probably heard about IMAGENET. It's a data set collected by Stanford. It took like over two years and $1 million to gather all this data. And they could annotate millions of images with labels. Say, in this image, there's a container received. There's a motor scooter. There's a leopard. And then, you label all these images so your model can learn from it. And that worked really well where you can have all these labels, and then you can train your model from it. The question is like, how will you provide the labels for this robot? What is the right actions? I don't know. It's not that clear. What will be the right answer for this case? So we are going to take a different approach, what is like reinforcement learning. Instead of trying to provide the right answer-- like in a classical setting, you will go to class, and they tell you what is the right answers. You know, you study, this is the answer for this problem. We already know what is the right answer. In reinforcement learning, we assume we don't know what is the right answer. We need to figure it out ourselves. It's more like a kid. It's playing around, putting these labels together. And eventually, they're able to stack it up together, and stand up. And that gives you like some reward. It's like, oh, you feel proud of it, and then you keep doing it. Which are the actions you took? Not so relevant. So let's formalize a little more what reinforcement learning is and how you can actually make these into more concrete examples. Let's take a simpler example, like this little game that you're trying to play. You want to bounce the ball around, move the pile at the bottom left or right, and then you want to hit all these bricks, and play this game, clear up, and win the game. So we're going to have this notion of an agent or program that's going to get some reservation. In this case, a friend is going to look at the game. What is the ball, where are the brakes, what is the puzzle, and take an action. I'm going to move to the left or I'm going to move to the right. And depending where you move, the ball will drop, or you actually start keeping the ball bouncing back. And we're going to have this notion of reward, what is like when you do well, we want you to get positive reward, so you reinforce that behavior. And when you do poorly, you will get negative reward. So we can define simple rules and simple things to basically call this behavior as a reward function. Every time you hit a brick, you get 10 points. Which actions do you need to do to hit the brick? I don't tell you. That's what you need to learn. But if you do it, I'm going to give the 10 points. And if you clear all the bricks, I'm going to give you actually a hundred points to encourage you to actually play this game very well. And every time the ball drops, you lose 50 points, which means, probably not a good idea to do that. And if you let the ball drop three times, game is over, you need to stop the game. So the good thing is about the reinforcement learning, you can apply to many different problems. And here are some examples that over the last year people have been applying reinforcement learning. And it goes from recommender instance in YouTube, data set to cooling, real robots. You can apply to math, chemistry, or a cute little robot in the middle, and things as complex as they go. Like DeepMind applied to AlphaGo and beat the best player in the world by using reinforcement learning. Now, let me switch a little bit to TF-Agents and what it is. So main idea of TF-Agents like doing reinforcement learning is not very easy. It requires a lot of tools and a lot of things that you need to build on your own. So we built this library that we use at Google, and we open source so everybody can use it to make reinforcement learning a lot easier to use. So we make it very robust. It's scalable, and it's good for beginners. If you are new to RL, we have a lot of notebooks, example documentation that you can start working on. And also, for complex problems, you can apply to real complex problems and use it for realistic cases. For people who want to create their own algorithm, we also make it easy to add new algorithms. It's well tested and easy to configure. And furthermore, we build it on top of TensorFlow 2.0 that you probably heard over at Google I/O before. And we make it in such a way so it's developing and debugging is a lot easier. You can use see TF-Eager mode and Keras and TF functions to make things a lot easier to build. Very modular, very extensible. Let me cover a little bit the main pieces of the software, so then when we go through the examples, you have a better sense. On the left side, we have all the data collection. When we play this game, we are going to collect data. We are going to play the game. We're collecting data so we can learn from it. And on the right side, we're going to have a training pipeline. When we have the data, like a data set, or log in, or games we play, we're going to transfer any proof or model-- in this case, the neural net-- I'm going to deploy, collect more data, and repeat. So now, let me hand it over to Eugene, who is going to go over the CartPole example. EUGENE BREVDO: Thanks, Sergio. Yeah, so the first example we're going to go over is a problem called Cartpole. This is one of the classical control problems where imagine that you have a pole in your hand, and it wants to fall over because of gravity. And you kind of have to move your hand left and right to keep it upright. And if it falls over, then game over. If you move off the screen by accident, then game over. So let's make that a little bit more concrete. In this environment, the observation is not the images that you see here. Instead, it's a four vector containing angles and velocities of the pole and the cart. The actions are the values 0 and 1 representing being able to take a left or a right. And the reward is the value 1.0 every time step or frame that the pole is up and hasn't fallen over more than 15 degrees from vertical. And once it has, the episode ends. OK, so if you were to implement this problem or environment yourself, you would subclass the TF-Agents by environment class, and you would provide two properties. One is called the observation aspect, and that defines what the observations are. And you would implement the action spec property, and that describes what actions the environment allows. And there are two major methods. One is reset, which resets the environment and brings the pole back to the center and vertical. And the set method, which accepts the action and updates any internal state and emits the observation and the reward for that time stamp. Now, for this particular problem, you don't have to do that. We support OpenAi, which is a very popular framework for environments in Python. And you can simply load CartPole from that. That's the first line. And now you can perform some introspection. You can interrogate the environment, say what is Observation Spec. Here, you can see that it's a forward vector of floating point values. Again, describing the angle and velocities of the pole. And the Action Spec is a scalar integer picking on values 0 and 1, representing left and right. So if you had your own policy that you had built, maybe a scripted policy, you would be able to interact with the environment by loading it, building your policy object, resetting the environment to get an initial state, and then iterating over and over again, passing the observation or the state to the policy, getting an action from that, passing the action back to the environment, maybe calculating your return, which is the sum of the rewards or all steps. Now, the interesting part comes when you want to make a tenable policy and you wanted to learn from its successes in the environment. To do that, we put a neural network in the loop. So the neural network takes in the observations, an image. In this case-- and the algorithm that we're talking about is called policy gradients, also known as reinforce-- it's going to emit probabilities over the actions that can be taken. So in this case, it's going to emit a probability of taking a left or a probability of taking a right, and that's parameterized by the weight of the neural network called data. And ultimately, the goal of this algorithm is going to be modifying the neural network over time to maximize what's called the expected return. And as I mentioned, the return is the sum of the rewards over the duration of the episode. And you can calculate it-- by just this expectation, it's difficult to calculate analytically. So what we're going to do is we're going to sample episodes by playing, we're going to get trajectories, and we're going to store those trajectories. These are observation action pairs over the episode. We're going to add them up. And that's our Monte Carlo estimate of the return. OK? And we're going to make a couple of checks to convert that expectation optimization problem into a sum that we can optimize using gradient descent. I'm going to skip over some of the math, but basically, what we use is something called the log trick to convert this gradient problem into the gradient over the outputs of the neural network. That's that log pi theta right there. That's the output of the network. And we're going to multiply that by the Monte Carlo estimate of the returns. And we're going to average over the timestamps within the episode and over many batches of episodes. Putting this into code-- and by the way, we implement this for you, but that's kind of a pseudo code here. You get this experience when you're training, you extract its rewards, and you do a cumulative sum type operation to calculate the returns. Then, you take the observations over all the time steps, and you calculate the lotus, the log probabilities coming out of the neural network. You pass those to a distribution object-- this is a TensorFlow probability distribution object-- to get the distributions over the action. And then you can calculate the full log probability of the actions that were taken in your trajectories and your logs, and calculate this approximation of the expectation and take its gradient. OK, so as an end user, you don't need to worry about that too much. What you want to do is you load your environment, you wrap it in something called a TF Py environment. And that's easy as the interaction between the Python problem setting and the environment and the neural network, which is being executed by the TensorFlow runtime. Now, you can also create your neural network. And here, you can write your own. And basically, it's a sequence of Keras layers. Those of you who are familiar with Keras, that makes it very easy to describe your own architecture for the network. We provide a number of neural networks. This one accepts a number of parameters that configure the architecture. So here, there are two fully connected layers with sizes 32 and 64. You pass this network and the specs associated with the environment to the agent class. And now you're ready to collect that and to train. So to collect data, you need a place to store it. And Sergio we'll talk about this more in the second example. But basically, we use something called replay buffers that are going to store these trajectories. And we provide a number of utilities that will collect the data for you, and they're called drivers. So this driver takes the environment, takes the policy exposed by the agent path and a number of callbacks. And what it's going to do is it's going to iterate collecting data, interacting with the environment, sending it actions, collecting observations, sending those to the policy, does that for you. And each time it does that, for every time stop, it's stores that in the replay buffer. So to train, you iterate calling a driver run, which populates the replay buffer. Then you pull out all of the trajectories in the replay buffer with gather all, you pass those to agent.train, which updates the underlying neural networks. And because policy gradients is in something called an on policy algorithm, all that hard earned collected data that you've done, you have to throw it away and collect more. OK? So that said, CartPole is a fairly straightforward classical problem, as I mentioned. And policy gradients is a fairly standard, somewhat simple algorithm. And after about 400 iterations of playing the game, you can see that whereas you started with a random policy, that can't keep the pole up at all. After 400 iterations of playing the game, you basically have a perfect policy. And if you were to look at your TensorBoard while you're training, you'd see a plot like this, which shows that as the number of episodes that are being collected increases the total return-- which is the sum of the rewards over the episode-- goes up pretty consistently. And at around 400, 500 episodes, we have a perfect algorithm that runs for 200 steps, at which point the episode says, all right, you're good, you win. And then you're done. OK, so I'm going to hand it back over to Sergio to talk about Atari and deep Q-learning. SERGIO GUADARRAMA: Thank you, again. So now we're going back to this example that I talked at the beginning about how to play this game. And now we're going to go through more details how this actually works, and how this deep Q-learning works to help us in this case. So let's go back to our setting. Now we have our environment where we're going to be playing. We're going to get some observations, in this case, frames. The agent role is to produce different actions like go left with the paddle or go right, and get some rewards in the process, and then improve over time by basically incorporating those rewards into the model. Let's take a little step and say, what if while I'm playing Breakout, I have seen how far what I've been doing, the ball is going somewhere, I'm moving in the centered direction, and then, what should I do now? Should I go to the right or should I go to the left? If I knew what is going to happen, it will be very easy. If I knew oh, the balls are going to go this way, the things are going to be that, that will be easy. But that one, we don't easily know. We don't know what's going to happen in the future. Instead, what we're going to do, we are going to try to estimate if I move to the right, maybe the ball will drop. It's more likely that the ball drop because I'm moving in the opposite direction that the ball is going. And if I move to the left, on the contrary, I'm going to hit the ball I'm going to hit some bricks, I'm getting closer to clear all the bricks. So the idea is like I want to learn a model that can estimate that. If this action is going to make me go better into the future or is going to make it go worse. And that's something that we call expected return. So is this nothing that Eugene was talking before, that before we were just computing, just summing on the rewards. And here, we're going to say, I want to estimate this action, how much reward it's going to give me in the future. And then, I choose the thing what it's according to my estimate is the best action. So we can formulate this with using math. It's basically like an expectancy over the sum of the rewards into the future. And that's when I call this Q function, or like critic. It's also a critic because it's going to basically tell us given some state and possible actions, which action is actually better? What I criticize in some way, like if you take this action, my expectation of the return is very high. And you take a different action, my expectation is low. And then what we're going to do, we're going to learn these Q functions. Because we don't know. We don't know what's going to happen. But by playing, we can learn what is the expected return by comparing our expectation with the actual returns. So we are going to use our Q function-- and in this case, a neural net-- to learn this model. And while we have a learned model, then we can just take the best action according to our model and play the game. So conceptually, this looks similar to what we saw before. We're going to have another neural net. And this case, the output is going to be the Q values, this expectation of our future returns. And the idea is we're going to get an observation, in this case, the frames. We're going to maybe have some history about it. And then we're going to preview some Q value, which like our current expectation if I move to the left, and my current expectation if I move to the right. And then I'm going to I compare my expectation, what will actually happen. And if basically my expectation is too high, I'm going to lower down. And if my expectation is too low, I'm going to increase it. So that way, we're going to change the weight of this network to basically improve over time by playing this game. We go back to how you do this into code. Basically, we're going to log this environment. In this case, from the suite Atari where it's also available for an OpenAi. I'm going to say, OK, load the Breakout game. And now, we are ready to play. We're going to have some reservations, we'll define what kind of reservations we have from this case where it's frames of like 84 by 84 pixels, and we also have multiple answers we can take. In this game, we can only go left and right, but there are other games in this suite that can have different actions. Maybe jumping, firing, and doing other things that different games have. So now, we want to do that notion what we said before. We're going to define this Q network. Remember, it's a neural net that is going to represent these Q values. I'm going to have some parameters that define how many layers, how many things we want to have on all those things. And then, we're going to have the Q agent that's going to take the network, and an optimizer which is going to basically be able to improve this network over time, given some experience. So this experience, we're going to assume we have collected some data and we have played the game. And maybe not very well at the beginning, because we are doing random actions, for example. So we're not playing very well. but we can get some experience, and then we can improve over time basically. We try to improve our estimates. Every time we improve, we play a little better. And then we collect more data. And then the idea is that this agent is going to have a train method that is going to go through this experience, and is going to improve over time. In general, for cases like games or environments are too slow, we don't want to play one game of the time, you know. These computers can play multiple games in parallel. So we have this notion that parallel environments, you can play multiple copies of the same game at the same time. So we can make learning a lot faster. And in this case, we are playing four games in parallel, we're going to have for a policy that we've got just defined. And in parallel, we can just play four games at the same time. So the agent in this case will try to play four games at the same time. And that way, we'll get a lot of more experience and can learn a lot faster. So as we mentioned before, where we have collected all this data by playing this game, in this case, we don't want to throw away the data. We can use it to learn it. So we're going to have this replay buffer, which is going to keep all the data we're collecting, like different games will go in different positions so we don't mix the games. But we're going to just throw all the data in some replay buffer. And that into the code, it's simple. We have this environment. We cleared the replay buffer we have already defined. And then basically, using the driver, and then more important than this, add to the replay buffer. Every time you play, take an action in this game, add it to the replay buffer. So later, the agent contains all that experience. And because DQN is our policy method-- what is different than their previous method was on policy-- in this case, we can actually use all the data. We can keep all the data around and keep training on all data too. We don't need you to throw away. And that's very important because we make it more efficient. What we're going to do when we have called the data and this replay buffer, we're going to do a sample. We're going to sample a different set of games, different parts of the game. I'm going to say, OK, let's try to replay the game and maybe take a different outcome this time. What action will you take if you were in the same situation? Maybe you move to the left, and the ball drop. So maybe now you want to move to the right. So that's the ways the model is going to be learning, by basically sample games that you played before, and now improve your key function is going to change the way you behave. So now let's try to put these things back together. Let's go slowly because there's a lot of pieces. So we have our Q network we're going to use to define the DQN agent in this case. We're going to have the replay buffer where we're going to put all the data we're collecting what we played. We have this driver, which basically drive the agent in the game. So it's going to basically driving the agent making play and add it to the replay buffer. And then, once we have enough data, we can basically iterate with that data. We can iterate, get batches of experience, different samples. And that's what we're going to do to train the agent. So we are going to alternate, collect more data, and train the agent. So every time we collect, we train the agent, the agent gets a little better, and we want to collect more data, and we alternate. At the end, what we want to do is evaluate this agent. So we have a method that says, OK, I want to compute some metrics. For example, how long are you playing the game, how many points are you getting, all those things that we want to compare metrics and aggressively have these methods. OK, how about take all these metrics in this environment, take the agent policy, and evaluate for multiple games, multiple episodes, and multiple things. How this actually looks like is something like that. For example, in the Breakout game, the curves looks like that. At the beginning, we don't score any points. We don't know how to move the pole, the ball just keep dropping, and we just lose the game over and over. Eventually, we figure out that by moving the paddle in different directions, the ball bounced back, and it started hitting the bricks. And about 4 or 5 million frames, they multilaterally learn how to actually play this game. And you can see around 4 or 5 million frames, basically, the model gets very good scores around 100 points. It's breaking all these things, all these points, and you know, clear all the bricks. We also put graphs of different games like Pong, which is basically two different paddles trying to bounce the ball between them. Enduro, Qbert, there's another like 50 or 60 games in this suite. And you can basically just change one line of code and play a different game. I'm not going to go through those details, but just to make clear that it's simple to play different games. Now, let me hand it over back to Eugene, who's going to talk a little more into the Minitaur. Thanks, again. EUGENE BREVDO: OK, so our third and final example is the problem of the Minitaur robot and kind of goes back to one of the first slides that Sergio showed at the beginning of the talk, learning for walk. So there is a real robot. It's called the Minitaur. And here, it's kind of failing hard. We're going to see if we can fix that. The algorithm we're going to use is called Soft Actor Critic. OK. So again, on the bottom is some images of the robot. And you can see it looks a little fragile. We want to train it, and we want to avoid breaking it in it beginning when our policy can't really stay up. So what we're going to do is we're going to model it in a physics simulator called PyBullet, and that's what you see at the top. And then, once we've trained it, we're confident about the policy on that version, we're going to transfer it back into the robot and do some final fine tuning. And here, we're going to focus on the training and simulation. So I won't go into the mathematical details of the Soft Actor Critic, but here's some fundamental aspects of that algorithm. One is that it can handle both discrete and continuous action spaces. Here, we're going to be controlling some motors and actuators, so it's a fairly continuous action space. It's data-efficient, meaning that all this hard earned data that you run in simulation or you got from the robot, you don't have to throw it away while you're training, you can keep it around for retraining. Also, the training is stable. Compared to some other algorithms, this one is less likely to diverge during training. And finally, one of the fundamental aspects is that it's Soft Actor Critic, it combines an actor neural network and a critic neural network to accelerate training and to keep it stable. Again, so Minitaur, you can basically do a pip install of PyBullet, and you'll get Minitaur for free. You can load it using the PyBullet fleet with TF-Agents. And if you were to look at this environment, you'd see that there are about 28 sensors on the robot that return floating point values, different aspects of the configuration where you are, forces, velocities, things like that. And the action, there are eight actuators on the robot. It can apply a force-- positive or negative, minus 1 to 1-- for each of those eight actuators. Now, here's kind of bringing together the whole setup. You can load four of these simulations, have them running in parallel, and try to maximize the number of course that you're using when you're collecting data. And to do that, we provide the parallel Py environment, which Sergio spoke about, wrapped in TF Py environment. And now, we get down to the business of setting up the neural network architecture for the problem. First, we create the actor network. And so what the actor network is going to do is it's going to take these sensor observations, this 28 vector, and it's going to limit samples of actuator values. And those samples are random draws from, in this case, Gaussian or normal distribution. So as a result, this action distribution network takes something called a projection network. And we provide a number of standard projection networks. This one emits samples from a Gaussian distribution. And the neural network that feeds into it is going to be setting up the hyperparameters of that distribution. Now, the critic network, which is in the top right, is going to take a combination of the current sensor observations and the action sample that the actor network emitted, and it's going to estimate the expected return. How much longer, given this action, is my robot going to stay up? How well is it going to gallop? And that is going to be trained from the trajectories from the rewards that you're collecting. And that, in turn, is going to help train the actor. So you pass these networks and these specs to the Soft Actor Critic agent, and you can look at its collection policy. And that's the thing that you're going to pass on the driver to start collecting data and interacting with the environment. So I won't go into the details of actually doing that because it's literally identical to the deep Q-learning example before. You need the replay buffer, and you use the driver and you go through the same motion. What I'm going to show is what you should expect to see in the TensorBoard while you're training the simulation. On the top, you see the average episode length, the average return as a function of the number of environment steps that you've taken-- the number of time steps. On the bottom, you see the same thing. But on the x-axis, you see the number of episodes that you've gone through. And what you can see is that after about 13,000, 14,000 simulated episodes, we're starting to really learn how to walk and gallop. The episode lengths get longer because it takes longer to fall down. And the average return also goes up because it's also a function of how long we stay up and how well we can gallop. So again, if this is a pilot simulation, a rendering of the Minitaur, at the very beginning, when the policy just emits random values, the neural network emits random values-- it's randomly initialized-- and it can barely stay up, it basically falls over. About halfway through training, it's starting to be able to get up, maybe make a few steps, falls over. If you apply some external forces, it'll just fall over. By about 16,000 iterations of this, it's a pretty robust policy. And it can stand, it can gallop. If there's an external force pushing it over, it'll be able to get back up and keep going. And once you have that trained policy, you can transfer it, export it as a safe model, put it on the actual robot, and then start the fine tuning process. Once you fine tuned it, you have a pretty neat robot. In my head, when I look at this video, I think of the "Chariots of Fire" theme song. I don't know if you've ever seen it, but it's pretty cool. So now, I'm going to return it back to Sergio to provide some final words. SERGIO GUADARRAMA: Thank you, again. So pretty cool, no? You can get from the beginning how to learn to walk and naturally make these in simulation. But then, we can transfer it to a real robot and make it work into a real robot. So that's part of the goal of TF-Agents. We want to make our role very easy. You can download the code, you can scan over there and go to the GitHub, start playing with it. We have already a lot of different environments, more than we talked today. There's many more. So we just covered three examples, but you can go there, there's many other environments available. We are hoping that Unity ML-Agents come soon so you can also interact with the Unity renders. Maybe there's some of you who are actually interesting to contribute into your own environments, your own problems. We are also very happy to take proof requests and contributions to everything. For those of you who say, OK, those games are really good. The games looks nice, but I have my own problem. What do I do? So let's go back to the beginning when we talk about you can define your environment, you can define your own task. This is the main piece that you need to follow. This is the API you need to follow to bring your task or your problem to TF-Agents. You define the specifications of your observations, like what things can I see? Can I see images, can I see numbers, what that means? What actions available do I have? Do I have two options, three options, 10 different options? What are the possibilities I have? And then the reset method because as we say, while we're learning, we need to keep trying. So we need to reset and start again. And then the stop function where it's like, if I give you an action, what will happen? How the environment, how the task is going to evolve? What is this state is going to change? And you need to tell me the reward. Am I doing well? Am I going in the right direction, or am I going in the wrong direction? So I can learn from it. So this is the main piece of code that you will need to implement to solve your own problem. Additionally, we only talked about three algorithms, but we have many more in the code base. You can see here, there are many more coming. So there's a lot of variety of different algorithms have different strands that you can apply to different problems. So you can just try different combinations and see which one actually works for your problem. And also, we are taking contributions for other people who say, oh, I have this algorithm, I want to implement this one, I have this new idea, and maybe you can solve other problems with your algorithm. And furthermore, we also apply these not only to this game, but we apply at Google, for example, in robotics. In this really complex problem, that we have multiple robots trying to learn how to grasp objects and moving to different places. So in this case, we have all these robots just trying to grasp and fail at the beginning. And eventually, they learned like, oh, where is the object? How do I move the hand? How do I close the gripper in the proper place? And now how do I grasp it? And this is a very complex task you can solve with reinforced learning. Furthermore, you can also solve many other problems for simple recommender systems like YouTube recommendations, Google Play, Navigation, News. Those are many other problems that you can basically say, I want to optimize for my objective, my long term value. Not only the short term, but like the long term value. And that is really good for that when you want to optimize for the long term value, not only the short term. Finally, we have put a lot of effort to make this code available and make it usable for a lot of people. But at Google, we also defined these AI principles. So when we developed all this code, we make it available, we follow these principles. We want to make it sure that it is used for things that benefit the society that doesn't reinforce unfair bias, that doesn't discriminate, that this built for tests for safety, privacy, in the beginning it's accountable. We keep very high standards. And we also want to make sure that everybody who uses this code also embraces those principles and trying to make it better. And there's many applications we want to pursue. We don't want to be this used for harming and all these damaged things that we know will happen. Finally, I want to thank the whole team. You know, it's not just Eugene and me, we made this happen. There's other people behind. These are the TF-Agents over here. There's a lot of contributors that have contributed to the code, and we are very proud of all the work they have done to make this happen, to make this possible to be open source and available for everyone. So as we said before, we want all of you to join us in GitHub. Go to the web page, download it, start playing with it. A really good place is go to the collapse on their notebooks and say, OK, I want to try the REINFORCE example, I want to try the DQN or the Soft Actor Critic. We have notebooks you can play, Google Cloud will run for you all these examples. And also, you have issues of pole requests, we welcome. So we want you to be part of their community, contribute to make this a lot better. And furthermore, we are also looking for new applications, what all of you can do with these new tools. There's a lot of new problems you can apply this, and we are looking forward to it. So thank you very much, and hope to see you around. [APPLAUSE] [MUSIC PLAYING]
B1 data network neural policy environment agent TF-Agents: A Flexible Reinforcement Learning Library for TensorFlow (Google I/O'19) 3 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary