Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • SERGIO GUADARRAMA: Today, we are going

  • to talk about reinforcement learning, how you can apply

  • to many different problems.

  • So hopefully, by the end of the talk,

  • you will know how to use reinforcement learning

  • for your problem, for your applications, what

  • other things we are doing at Google with all

  • these new technology.

  • So let me go a little bit--

  • do you remember when you try to do something difficult

  • that was hard that you need to try a lot?

  • For example, when you learned how to walk, do you remember?

  • I don't remember.

  • But it's pretty hard because nobody tells you

  • exactly how to do it.

  • You just keep trying.

  • And eventually, you're able to stand up, keep the balance,

  • wobble around, and start walking.

  • So what if we want to teach this cute little robot how to walk?

  • Imagine-- how you will do that?

  • How would you tell this robot how to walk?

  • So what we are going to do today is

  • learn how we can do that with machine learning.

  • And the reason for that is because if we

  • want to do this by calling a set of rules,

  • it will be really hard.

  • What kind of rules would we put in code that can actually

  • make this robot the walk?

  • We have to do coordination, balance.

  • It's really difficult. And then they probably

  • would just fall over.

  • And we don't know what to change in the code.

  • Instead of that, we're going to use machine

  • learning to learn from it.

  • So the agenda for today is going to be this.

  • We are going to cover very quickly what

  • is supervised learning, reinforcement learning, what

  • is TF-Agents, these things we just talk about it.

  • And we will go through multiple examples.

  • So you can see we can build up different pieces to actually

  • go and solve this problem, teach this robot how to walk.

  • And finally, we will have some take home methods that you

  • can take with you all today.

  • So how many of you know what is supervised learning?

  • OK.

  • That's pretty good.

  • For those of you who don't know, let's go

  • to a very simple example.

  • So we're going to have some inputs, in this case,

  • like an image.

  • And we're going to pass through our model,

  • and we're going to put in some outputs.

  • In this case, there's going to be a cat or the dog.

  • And then, we're going to tell you what is the right answer.

  • So that's the key aspect.

  • In supervising learning, we tell you the label.

  • What is the right answer?

  • So you can modify your model and learn from these mistakes.

  • In this case, you might use a neural net.

  • We have a lot of ways that you can learn.

  • And you can modify those connections

  • to basically learn over time what is the right answer.

  • The thing that supervised learning need

  • is a lot of labels.

  • So many of you probably heard about IMAGENET.

  • It's a data set collected by Stanford.

  • It took like over two years and $1 million

  • to gather all this data.

  • And they could annotate millions of images with labels.

  • Say, in this image, there's a container received.

  • There's a motor scooter.

  • There's a leopard.

  • And then, you label all these images

  • so your model can learn from it.

  • And that worked really well where

  • you can have all these labels, and then you

  • can train your model from it.

  • The question is like, how will you provide the labels

  • for this robot?

  • What is the right actions?

  • I don't know.

  • It's not that clear.

  • What will be the right answer for this case?

  • So we are going to take a different approach, what

  • is like reinforcement learning.

  • Instead of trying to provide the right answer--

  • like in a classical setting, you will go to class,

  • and they tell you what is the right answers.

  • You know, you study, this is the answer for this problem.

  • We already know what is the right answer.

  • In reinforcement learning, we assume

  • we don't know what is the right answer.

  • We need to figure it out ourselves.

  • It's more like a kid.

  • It's playing around, putting these labels together.

  • And eventually, they're able to stack it up together, and stand

  • up.

  • And that gives you like some reward.

  • It's like, oh, you feel proud of it, and then you keep doing it.

  • Which are the actions you took?

  • Not so relevant.

  • So let's formalize a little more what reinforcement learning is

  • and how you can actually make these

  • into more concrete examples.

  • Let's take a simpler example, like this little game

  • that you're trying to play.

  • You want to bounce the ball around, move the pile

  • at the bottom left or right, and then you

  • want to hit all these bricks, and play this game,

  • clear up, and win the game.

  • So we're going to have this notion of an agent

  • or program that's going to get some reservation.

  • In this case, a friend is going to look at the game.

  • What is the ball, where are the brakes, what is the puzzle,

  • and take an action.

  • I'm going to move to the left or I'm going to move to the right.

  • And depending where you move, the ball will drop,

  • or you actually start keeping the ball bouncing back.

  • And we're going to have this notion of reward,

  • what is like when you do well, we

  • want you to get positive reward, so you reinforce that behavior.

  • And when you do poorly, you will get negative reward.

  • So we can define simple rules and simple things

  • to basically call this behavior as a reward function.

  • Every time you hit a brick, you get 10 points.

  • Which actions do you need to do to hit the brick?

  • I don't tell you.

  • That's what you need to learn.

  • But if you do it, I'm going to give the 10 points.

  • And if you clear all the bricks, I'm

  • going to give you actually a hundred

  • points to encourage you to actually play

  • this game very well.

  • And every time the ball drops, you

  • lose 50 points, which means, probably not

  • a good idea to do that.

  • And if you let the ball drop three times, game is over,

  • you need to stop the game.

  • So the good thing is about the reinforcement learning,

  • you can apply to many different problems.

  • And here are some examples that over the last year people

  • have been applying reinforcement learning.

  • And it goes from recommender instance in YouTube, data

  • set to cooling, real robots.

  • You can apply to math, chemistry,

  • or a cute little robot in the middle, and things

  • as complex as they go.

  • Like DeepMind applied to AlphaGo and beat

  • the best player in the world by using reinforcement learning.

  • Now, let me switch a little bit to TF-Agents and what it is.

  • So main idea of TF-Agents like doing reinforcement learning

  • is not very easy.

  • It requires a lot of tools and a lot of things

  • that you need to build on your own.

  • So we built this library that we use at Google,

  • and we open source so everybody can

  • use it to make reinforcement learning a lot easier to use.

  • So we make it very robust.

  • It's scalable, and it's good for beginners.

  • If you are new to RL, we have a lot

  • of notebooks, example documentation

  • that you can start working on.

  • And also, for complex problems, you

  • can apply to real complex problems

  • and use it for realistic cases.

  • For people who want to create their own algorithm,

  • we also make it easy to add new algorithms.

  • It's well tested and easy to configure.

  • And furthermore, we build it on top of TensorFlow 2.0

  • that you probably heard over at Google I/O before.

  • And we make it in such a way so it's developing and debugging

  • is a lot easier.

  • You can use see TF-Eager mode and Keras and TF functions

  • to make things a lot easier to build.

  • Very modular, very extensible.

  • Let me cover a little bit the main pieces of the software,

  • so then when we go through the examples,

  • you have a better sense.

  • On the left side, we have all the data collection.

  • When we play this game, we are going to collect data.

  • We are going to play the game.

  • We're collecting data so we can learn from it.

  • And on the right side, we're going

  • to have a training pipeline.

  • When we have the data, like a data set,

  • or log in, or games we play, we're going to transfer any

  • proof or model-- in this case, the neural net--

  • I'm going to deploy, collect more data, and repeat.

  • So now, let me hand it over to Eugene,

  • who is going to go over the CartPole example.

  • EUGENE BREVDO: Thanks, Sergio.

  • Yeah, so the first example we're going to go over

  • is a problem called Cartpole.

  • This is one of the classical control problems

  • where imagine that you have a pole in your hand,

  • and it wants to fall over because of gravity.

  • And you kind of have to move your hand left and right

  • to keep it upright.

  • And if it falls over, then game over.

  • If you move off the screen by accident, then game over.

  • So let's make that a little bit more concrete.

  • In this environment, the observation is not the images

  • that you see here.

  • Instead, it's a four vector containing

  • angles and velocities of the pole and the cart.

  • The actions are the values 0 and 1

  • representing being able to take a left or a right.

  • And the reward is the value 1.0 every time step or frame

  • that the pole is up and hasn't fallen over more

  • than 15 degrees from vertical.

  • And once it has, the episode ends.

  • OK, so if you were to implement this problem or environment

  • yourself, you would subclass the TF-Agents by environment class,

  • and you would provide two properties.

  • One is called the observation aspect,

  • and that defines what the observations are.

  • And you would implement the action spec property,

  • and that describes what actions the environment allows.

  • And there are two major methods.

  • One is reset, which resets the environment

  • and brings the pole back to the center and vertical.

  • And the set method, which accepts the action and updates

  • any internal state and emits the observation and the reward

  • for that time stamp.

  • Now, for this particular problem,

  • you don't have to do that.

  • We support OpenAi, which is a very popular framework

  • for environments in Python.

  • And you can simply load CartPole from that.

  • That's the first line.

  • And now you can perform some introspection.

  • You can interrogate the environment, say what is

  • Observation Spec.

  • Here, you can see that it's a forward vector of floating

  • point values.

  • Again, describing the angle and velocities of the pole.

  • And the Action Spec is a scalar integer

  • picking on values 0 and 1, representing left and right.

  • So if you had your own policy that you had built,

  • maybe a scripted policy, you would

  • be able to interact with the environment

  • by loading it, building your policy object,

  • resetting the environment to get an initial state,

  • and then iterating over and over again,

  • passing the observation or the state to the policy,

  • getting an action from that, passing the action back

  • to the environment, maybe calculating your return, which

  • is the sum of the rewards or all steps.

  • Now, the interesting part comes when

  • you want to make a tenable policy

  • and you wanted to learn from its successes in the environment.

  • To do that, we put a neural network in the loop.

  • So the neural network takes in the observations, an image.

  • In this case-- and the algorithm that we're talking about is

  • called policy gradients, also known as reinforce--

  • it's going to emit probabilities over the actions that

  • can be taken.

  • So in this case, it's going to emit

  • a probability of taking a left or a probability of taking

  • a right, and that's parameterized

  • by the weight of the neural network called data.

  • And ultimately, the goal of this algorithm

  • is going to be modifying the neural network over time

  • to maximize what's called the expected return.

  • And as I mentioned, the return is

  • the sum of the rewards over the duration of the episode.

  • And you can calculate it--

  • by just this expectation, it's difficult to calculate

  • analytically.

  • So what we're going to do is we're going to sample episodes

  • by playing, we're going to get trajectories,

  • and we're going to store those trajectories.

  • These are observation action pairs over the episode.

  • We're going to add them up.

  • And that's our Monte Carlo estimate of the return.

  • OK?

  • And we're going to make a couple of checks

  • to convert that expectation optimization problem into a sum

  • that we can optimize using gradient descent.

  • I'm going to skip over some of the math,

  • but basically, what we use is something called the log

  • trick to convert this gradient problem

  • into the gradient over the outputs of the neural network.

  • That's that log pi theta right there.

  • That's the output of the network.

  • And we're going to multiply that by the Monte Carlo

  • estimate of the returns.

  • And we're going to average over the timestamps

  • within the episode and over many batches of episodes.

  • Putting this into code--

  • and by the way, we implement this for you,

  • but that's kind of a pseudo code here.

  • You get this experience when you're training,

  • you extract its rewards, and you do a cumulative sum type

  • operation to calculate the returns.

  • Then, you take the observations over all the time steps,

  • and you calculate the lotus, the log probabilities coming out

  • of the neural network.

  • You pass those to a distribution object--

  • this is a TensorFlow probability distribution object--

  • to get the distributions over the action.

  • And then you can calculate the full log probability

  • of the actions that were taken in your trajectories

  • and your logs, and calculate this approximation

  • of the expectation and take its gradient.

  • OK, so as an end user, you don't need

  • to worry about that too much.

  • What you want to do is you load your environment,

  • you wrap it in something called a TF Py environment.

  • And that's easy as the interaction between the Python

  • problem setting and the environment

  • and the neural network, which is being executed

  • by the TensorFlow runtime.

  • Now, you can also create your neural network.

  • And here, you can write your own.

  • And basically, it's a sequence of Keras layers.

  • Those of you who are familiar with Keras,

  • that makes it very easy to describe your own architecture

  • for the network.

  • We provide a number of neural networks.

  • This one accepts a number of parameters that

  • configure the architecture.

  • So here, there are two fully connected layers

  • with sizes 32 and 64.

  • You pass this network and the specs

  • associated with the environment to the agent class.

  • And now you're ready to collect that and to train.

  • So to collect data, you need a place to store it.

  • And Sergio we'll talk about this more in the second example.

  • But basically, we use something called

  • replay buffers that are going to store these trajectories.

  • And we provide a number of utilities

  • that will collect the data for you,

  • and they're called drivers.

  • So this driver takes the environment,

  • takes the policy exposed by the agent path

  • and a number of callbacks.

  • And what it's going to do is it's

  • going to iterate collecting data, interacting

  • with the environment, sending it actions, collecting

  • observations, sending those to the policy, does that for you.

  • And each time it does that, for every time stop,

  • it's stores that in the replay buffer.

  • So to train, you iterate calling a driver run, which

  • populates the replay buffer.

  • Then you pull out all of the trajectories in the replay

  • buffer with gather all, you pass those

  • to agent.train, which updates the underlying neural networks.

  • And because policy gradients is in something called an

  • on policy algorithm, all that hard earned

  • collected data that you've done, you

  • have to throw it away and collect more.

  • OK?

  • So that said, CartPole is a fairly straightforward

  • classical problem, as I mentioned.

  • And policy gradients is a fairly standard, somewhat simple

  • algorithm.

  • And after about 400 iterations of playing the game,

  • you can see that whereas you started

  • with a random policy, that can't keep the pole up at all.

  • After 400 iterations of playing the game,

  • you basically have a perfect policy.

  • And if you were to look at your TensorBoard

  • while you're training, you'd see a plot

  • like this, which shows that as the number of episodes

  • that are being collected increases the total return--

  • which is the sum of the rewards over the episode--

  • goes up pretty consistently.

  • And at around 400, 500 episodes, we have a perfect algorithm

  • that runs for 200 steps, at which point the episode says,

  • all right, you're good, you win.

  • And then you're done.

  • OK, so I'm going to hand it back over to Sergio

  • to talk about Atari and deep Q-learning.

  • SERGIO GUADARRAMA: Thank you, again.

  • So now we're going back to this example

  • that I talked at the beginning about how to play this game.

  • And now we're going to go through more details

  • how this actually works, and how this deep Q-learning works

  • to help us in this case.

  • So let's go back to our setting.

  • Now we have our environment where

  • we're going to be playing.

  • We're going to get some observations,

  • in this case, frames.

  • The agent role is to produce different actions

  • like go left with the paddle or go right, and get some rewards

  • in the process, and then improve over time

  • by basically incorporating those rewards into the model.

  • Let's take a little step and say,

  • what if while I'm playing Breakout,

  • I have seen how far what I've been doing,

  • the ball is going somewhere, I'm moving

  • in the centered direction, and then, what should I do now?

  • Should I go to the right or should I go to the left?

  • If I knew what is going to happen, it will be very easy.

  • If I knew oh, the balls are going to go this way,

  • the things are going to be that, that will be easy.

  • But that one, we don't easily know.

  • We don't know what's going to happen in the future.

  • Instead, what we're going to do, we

  • are going to try to estimate if I move to the right,

  • maybe the ball will drop.

  • It's more likely that the ball drop

  • because I'm moving in the opposite direction

  • that the ball is going.

  • And if I move to the left, on the contrary,

  • I'm going to hit the ball I'm going to hit some bricks,

  • I'm getting closer to clear all the bricks.

  • So the idea is like I want to learn

  • a model that can estimate that.

  • If this action is going to make me go better into the future

  • or is going to make it go worse.

  • And that's something that we call expected return.

  • So is this nothing that Eugene was talking before,

  • that before we were just computing, just summing

  • on the rewards.

  • And here, we're going to say, I want

  • to estimate this action, how much reward it's

  • going to give me in the future.

  • And then, I choose the thing what

  • it's according to my estimate is the best action.

  • So we can formulate this with using math.

  • It's basically like an expectancy over the sum

  • of the rewards into the future.

  • And that's when I call this Q function, or like critic.

  • It's also a critic because it's going to basically tell us

  • given some state and possible actions, which

  • action is actually better?

  • What I criticize in some way, like if you take this action,

  • my expectation of the return is very high.

  • And you take a different action, my expectation is low.

  • And then what we're going to do, we're

  • going to learn these Q functions.

  • Because we don't know.

  • We don't know what's going to happen.

  • But by playing, we can learn what

  • is the expected return by comparing our expectation

  • with the actual returns.

  • So we are going to use our Q function-- and in this case,

  • a neural net--

  • to learn this model.

  • And while we have a learned model,

  • then we can just take the best action according to our model

  • and play the game.

  • So conceptually, this looks similar to what we saw before.

  • We're going to have another neural net.

  • And this case, the output is going

  • to be the Q values, this expectation

  • of our future returns.

  • And the idea is we're going to get an observation,

  • in this case, the frames.

  • We're going to maybe have some history about it.

  • And then we're going to preview some Q value, which

  • like our current expectation if I move to the left,

  • and my current expectation if I move to the right.

  • And then I'm going to I compare my expectation,

  • what will actually happen.

  • And if basically my expectation is too high,

  • I'm going to lower down.

  • And if my expectation is too low, I'm going to increase it.

  • So that way, we're going to change

  • the weight of this network to basically improve over time

  • by playing this game.

  • We go back to how you do this into code.

  • Basically, we're going to log this environment.

  • In this case, from the suite Atari where it's also

  • available for an OpenAi.

  • I'm going to say, OK, load the Breakout game.

  • And now, we are ready to play.

  • We're going to have some reservations,

  • we'll define what kind of reservations

  • we have from this case where it's frames of like 84

  • by 84 pixels, and we also have multiple answers we can take.

  • In this game, we can only go left and right,

  • but there are other games in this suite that

  • can have different actions.

  • Maybe jumping, firing, and doing other things

  • that different games have.

  • So now, we want to do that notion what we said before.

  • We're going to define this Q network.

  • Remember, it's a neural net that is going

  • to represent these Q values.

  • I'm going to have some parameters that

  • define how many layers, how many things we want

  • to have on all those things.

  • And then, we're going to have the Q agent that's

  • going to take the network, and an optimizer which

  • is going to basically be able to improve this network over time,

  • given some experience.

  • So this experience, we're going to assume

  • we have collected some data and we have played the game.

  • And maybe not very well at the beginning,

  • because we are doing random actions, for example.

  • So we're not playing very well.

  • but we can get some experience, and then we

  • can improve over time basically.

  • We try to improve our estimates.

  • Every time we improve, we play a little better.

  • And then we collect more data.

  • And then the idea is that this agent

  • is going to have a train method that

  • is going to go through this experience,

  • and is going to improve over time.

  • In general, for cases like games or environments are too slow,

  • we don't want to play one game of the time, you know.

  • These computers can play multiple games in parallel.

  • So we have this notion that parallel environments,

  • you can play multiple copies of the same game at the same time.

  • So we can make learning a lot faster.

  • And in this case, we are playing four games in parallel,

  • we're going to have for a policy that we've got just defined.

  • And in parallel, we can just play four games

  • at the same time.

  • So the agent in this case will try

  • to play four games at the same time.

  • And that way, we'll get a lot of more experience

  • and can learn a lot faster.

  • So as we mentioned before, where we have collected all this data

  • by playing this game, in this case,

  • we don't want to throw away the data.

  • We can use it to learn it.

  • So we're going to have this replay buffer, which

  • is going to keep all the data we're collecting,

  • like different games will go in different positions

  • so we don't mix the games.

  • But we're going to just throw all the data in some replay

  • buffer.

  • And that into the code, it's simple.

  • We have this environment.

  • We cleared the replay buffer we have already defined.

  • And then basically, using the driver, and then

  • more important than this, add to the replay buffer.

  • Every time you play, take an action in this game,

  • add it to the replay buffer.

  • So later, the agent contains all that experience.

  • And because DQN is our policy method--

  • what is different than their previous method was on policy--

  • in this case, we can actually use all the data.

  • We can keep all the data around and keep training

  • on all data too.

  • We don't need you to throw away.

  • And that's very important because we

  • make it more efficient.

  • What we're going to do when we have

  • called the data and this replay buffer,

  • we're going to do a sample.

  • We're going to sample a different set of games,

  • different parts of the game.

  • I'm going to say, OK, let's try to replay the game

  • and maybe take a different outcome this time.

  • What action will you take if you were in the same situation?

  • Maybe you move to the left, and the ball drop.

  • So maybe now you want to move to the right.

  • So that's the ways the model is going

  • to be learning, by basically sample games that you played

  • before, and now improve your key function is going

  • to change the way you behave.

  • So now let's try to put these things back together.

  • Let's go slowly because there's a lot of pieces.

  • So we have our Q network we're going

  • to use to define the DQN agent in this case.

  • We're going to have the replay buffer where

  • we're going to put all the data we're

  • collecting what we played.

  • We have this driver, which basically

  • drive the agent in the game.

  • So it's going to basically driving

  • the agent making play and add it to the replay buffer.

  • And then, once we have enough data,

  • we can basically iterate with that data.

  • We can iterate, get batches of experience, different samples.

  • And that's what we're going to do to train the agent.

  • So we are going to alternate, collect more data,

  • and train the agent.

  • So every time we collect, we train the agent,

  • the agent gets a little better, and we

  • want to collect more data, and we alternate.

  • At the end, what we want to do is evaluate this agent.

  • So we have a method that says, OK, I

  • want to compute some metrics.

  • For example, how long are you playing

  • the game, how many points are you getting,

  • all those things that we want to compare

  • metrics and aggressively have these methods.

  • OK, how about take all these metrics in this environment,

  • take the agent policy, and evaluate

  • for multiple games, multiple episodes, and multiple things.

  • How this actually looks like is something like that.

  • For example, in the Breakout game,

  • the curves looks like that.

  • At the beginning, we don't score any points.

  • We don't know how to move the pole,

  • the ball just keep dropping, and we just

  • lose the game over and over.

  • Eventually, we figure out that by moving the paddle

  • in different directions, the ball bounced back,

  • and it started hitting the bricks.

  • And about 4 or 5 million frames, they multilaterally

  • learn how to actually play this game.

  • And you can see around 4 or 5 million

  • frames, basically, the model gets very good scores

  • around 100 points.

  • It's breaking all these things, all these points,

  • and you know, clear all the bricks.

  • We also put graphs of different games

  • like Pong, which is basically two different paddles trying

  • to bounce the ball between them.

  • Enduro, Qbert, there's another like 50 or 60

  • games in this suite.

  • And you can basically just change one line of code

  • and play a different game.

  • I'm not going to go through those details,

  • but just to make clear that it's simple to play different games.

  • Now, let me hand it over back to Eugene,

  • who's going to talk a little more into the Minitaur.

  • Thanks, again.

  • EUGENE BREVDO: OK, so our third and final example

  • is the problem of the Minitaur robot

  • and kind of goes back to one of the first slides

  • that Sergio showed at the beginning of the talk,

  • learning for walk.

  • So there is a real robot.

  • It's called the Minitaur.

  • And here, it's kind of failing hard.

  • We're going to see if we can fix that.

  • The algorithm we're going to use is called Soft Actor Critic.

  • OK.

  • So again, on the bottom is some images of the robot.

  • And you can see it looks a little fragile.

  • We want to train it, and we want to avoid

  • breaking it in it beginning when our policy can't really

  • stay up.

  • So what we're going to do is we're

  • going to model it in a physics simulator called PyBullet,

  • and that's what you see at the top.

  • And then, once we've trained it, we're

  • confident about the policy on that version,

  • we're going to transfer it back into the robot

  • and do some final fine tuning.

  • And here, we're going to focus on the training and simulation.

  • So I won't go into the mathematical details

  • of the Soft Actor Critic, but here's some fundamental aspects

  • of that algorithm.

  • One is that it can handle both discrete and continuous action

  • spaces.

  • Here, we're going to be controlling some motors

  • and actuators, so it's a fairly continuous action space.

  • It's data-efficient, meaning that all this hard earned data

  • that you run in simulation or you got from the robot,

  • you don't have to throw it away while you're training,

  • you can keep it around for retraining.

  • Also, the training is stable.

  • Compared to some other algorithms,

  • this one is less likely to diverge during training.

  • And finally, one of the fundamental aspects

  • is that it's Soft Actor Critic, it

  • combines an actor neural network and a critic neural network

  • to accelerate training and to keep it stable.

  • Again, so Minitaur, you can basically

  • do a pip install of PyBullet, and you'll

  • get Minitaur for free.

  • You can load it using the PyBullet fleet with TF-Agents.

  • And if you were to look at this environment,

  • you'd see that there are about 28 sensors on the robot

  • that return floating point values,

  • different aspects of the configuration where

  • you are, forces, velocities, things like that.

  • And the action, there are eight actuators on the robot.

  • It can apply a force--

  • positive or negative, minus 1 to 1--

  • for each of those eight actuators.

  • Now, here's kind of bringing together the whole setup.

  • You can load four of these simulations,

  • have them running in parallel, and try

  • to maximize the number of course that you're using

  • when you're collecting data.

  • And to do that, we provide the parallel Py environment, which

  • Sergio spoke about, wrapped in TF Py environment.

  • And now, we get down to the business

  • of setting up the neural network architecture for the problem.

  • First, we create the actor network.

  • And so what the actor network is going to do

  • is it's going to take these sensor observations, this 28

  • vector, and it's going to limit samples of actuator values.

  • And those samples are random draws from,

  • in this case, Gaussian or normal distribution.

  • So as a result, this action distribution network

  • takes something called a projection network.

  • And we provide a number of standard projection networks.

  • This one emits samples from a Gaussian distribution.

  • And the neural network that feeds into it

  • is going to be setting up the hyperparameters

  • of that distribution.

  • Now, the critic network, which is in the top right,

  • is going to take a combination of the current sensor

  • observations and the action sample

  • that the actor network emitted, and it's

  • going to estimate the expected return.

  • How much longer, given this action,

  • is my robot going to stay up?

  • How well is it going to gallop?

  • And that is going to be trained from the trajectories

  • from the rewards that you're collecting.

  • And that, in turn, is going to help train the actor.

  • So you pass these networks and these specs

  • to the Soft Actor Critic agent, and you can

  • look at its collection policy.

  • And that's the thing that you're going

  • to pass on the driver to start collecting data

  • and interacting with the environment.

  • So I won't go into the details of actually doing

  • that because it's literally identical to the deep

  • Q-learning example before.

  • You need the replay buffer, and you use the driver

  • and you go through the same motion.

  • What I'm going to show is what you

  • should expect to see in the TensorBoard

  • while you're training the simulation.

  • On the top, you see the average episode length,

  • the average return as a function of the number of environment

  • steps that you've taken-- the number of time steps.

  • On the bottom, you see the same thing.

  • But on the x-axis, you see the number of episodes

  • that you've gone through.

  • And what you can see is that after about 13,000,

  • 14,000 simulated episodes, we're starting to really learn

  • how to walk and gallop.

  • The episode lengths get longer because it

  • takes longer to fall down.

  • And the average return also goes up

  • because it's also a function of how long we stay up

  • and how well we can gallop.

  • So again, if this is a pilot simulation,

  • a rendering of the Minitaur, at the very beginning,

  • when the policy just emits random values,

  • the neural network emits random values--

  • it's randomly initialized--

  • and it can barely stay up, it basically falls over.

  • About halfway through training, it's

  • starting to be able to get up, maybe make a few steps,

  • falls over.

  • If you apply some external forces, it'll just fall over.

  • By about 16,000 iterations of this,

  • it's a pretty robust policy.

  • And it can stand, it can gallop.

  • If there's an external force pushing it over,

  • it'll be able to get back up and keep going.

  • And once you have that trained policy,

  • you can transfer it, export it as a safe model,

  • put it on the actual robot, and then start the fine tuning

  • process.

  • Once you fine tuned it, you have a pretty neat robot.

  • In my head, when I look at this video,

  • I think of the "Chariots of Fire" theme song.

  • I don't know if you've ever seen it, but it's pretty cool.

  • So now, I'm going to return it back to Sergio

  • to provide some final words.

  • SERGIO GUADARRAMA: Thank you, again.

  • So pretty cool, no?

  • You can get from the beginning how

  • to learn to walk and naturally make these in simulation.

  • But then, we can transfer it to a real robot

  • and make it work into a real robot.

  • So that's part of the goal of TF-Agents.

  • We want to make our role very easy.

  • You can download the code, you can scan over there

  • and go to the GitHub, start playing with it.

  • We have already a lot of different environments,

  • more than we talked today.

  • There's many more.

  • So we just covered three examples, but you can go there,

  • there's many other environments available.

  • We are hoping that Unity ML-Agents

  • come soon so you can also interact with the Unity

  • renders.

  • Maybe there's some of you who are actually

  • interesting to contribute into your own environments,

  • your own problems.

  • We are also very happy to take proof requests

  • and contributions to everything.

  • For those of you who say, OK, those games are really good.

  • The games looks nice, but I have my own problem.

  • What do I do?

  • So let's go back to the beginning when

  • we talk about you can define your environment,

  • you can define your own task.

  • This is the main piece that you need to follow.

  • This is the API you need to follow

  • to bring your task or your problem to TF-Agents.

  • You define the specifications of your observations,

  • like what things can I see?

  • Can I see images, can I see numbers, what that means?

  • What actions available do I have?

  • Do I have two options, three options, 10 different options?

  • What are the possibilities I have?

  • And then the reset method because as we say,

  • while we're learning, we need to keep trying.

  • So we need to reset and start again.

  • And then the stop function where it's

  • like, if I give you an action, what will happen?

  • How the environment, how the task is going to evolve?

  • What is this state is going to change?

  • And you need to tell me the reward.

  • Am I doing well?

  • Am I going in the right direction,

  • or am I going in the wrong direction?

  • So I can learn from it.

  • So this is the main piece of code

  • that you will need to implement to solve your own problem.

  • Additionally, we only talked about three algorithms,

  • but we have many more in the code base.

  • You can see here, there are many more coming.

  • So there's a lot of variety of different algorithms

  • have different strands that you can

  • apply to different problems.

  • So you can just try different combinations

  • and see which one actually works for your problem.

  • And also, we are taking contributions for other people

  • who say, oh, I have this algorithm,

  • I want to implement this one, I have this new idea,

  • and maybe you can solve other problems with your algorithm.

  • And furthermore, we also apply these not only to this game,

  • but we apply at Google, for example, in robotics.

  • In this really complex problem, that we

  • have multiple robots trying to learn how to grasp objects

  • and moving to different places.

  • So in this case, we have all these robots just trying

  • to grasp and fail at the beginning.

  • And eventually, they learned like, oh, where is the object?

  • How do I move the hand?

  • How do I close the gripper in the proper place?

  • And now how do I grasp it?

  • And this is a very complex task you can

  • solve with reinforced learning.

  • Furthermore, you can also solve many other problems

  • for simple recommender systems like YouTube recommendations,

  • Google Play, Navigation, News.

  • Those are many other problems that you can basically

  • say, I want to optimize for my objective, my long term value.

  • Not only the short term, but like the long term value.

  • And that is really good for that when

  • you want to optimize for the long term

  • value, not only the short term.

  • Finally, we have put a lot of effort

  • to make this code available and make

  • it usable for a lot of people.

  • But at Google, we also defined these AI principles.

  • So when we developed all this code, we make it available,

  • we follow these principles.

  • We want to make it sure that it is used for things that benefit

  • the society that doesn't reinforce unfair bias, that

  • doesn't discriminate, that this built

  • for tests for safety, privacy, in the beginning

  • it's accountable.

  • We keep very high standards.

  • And we also want to make sure that everybody

  • who uses this code also embraces those principles

  • and trying to make it better.

  • And there's many applications we want to pursue.

  • We don't want to be this used for harming

  • and all these damaged things that we know will happen.

  • Finally, I want to thank the whole team.

  • You know, it's not just Eugene and me, we made this happen.

  • There's other people behind.

  • These are the TF-Agents over here.

  • There's a lot of contributors that

  • have contributed to the code, and we

  • are very proud of all the work they

  • have done to make this happen, to make

  • this possible to be open source and available for everyone.

  • So as we said before, we want all of you

  • to join us in GitHub.

  • Go to the web page, download it, start playing with it.

  • A really good place is go to the collapse on their notebooks

  • and say, OK, I want to try the REINFORCE example,

  • I want to try the DQN or the Soft Actor Critic.

  • We have notebooks you can play, Google Cloud

  • will run for you all these examples.

  • And also, you have issues of pole requests, we welcome.

  • So we want you to be part of their community,

  • contribute to make this a lot better.

  • And furthermore, we are also looking

  • for new applications, what all of you

  • can do with these new tools.

  • There's a lot of new problems you can apply this,

  • and we are looking forward to it.

  • So thank you very much, and hope to see you around.

  • [APPLAUSE]

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it