Placeholder Image

Subtitles section Play video

  • OSCAR RAMIREZ: All right.

  • Well, thank you, everyone.

  • So I'm Oscar Ramirez.

  • This is Sergio.

  • And we're from the TF-Agents team.

  • And we'll talk to you guys about our project.

  • So for those of you that don't know,

  • TF-Agents is our reinforcement Learning library

  • built in TensorFlow.

  • And it's hopefully a reliable, scalable,

  • and easy-to-use library.

  • We packaged it with a lot of Colabs, examples,

  • and documentation to try and make it easy for people

  • to jump into reinforcement learning.

  • And we use it internally to actually solve

  • a lot of difficult tasks with reinforcement learning.

  • In our experience, it's been pretty easy to develop new RL

  • algorithms.

  • And we have a whole bunch of tests,

  • making it easy to configure and reproduce results.

  • A lot of this wouldn't be possible without everyone's

  • contribution, so I just want to make it clear,

  • this has been a team effort.

  • There has spent a lot of 20 percenters,

  • external contributors.

  • People have come and gone within the team, as well.

  • And so this is right now the biggest chunk

  • of the current team that is working on TF-Agents.

  • With that, I'll let Sergio talk a bit more about RL in general.

  • SERGIO GUADARRAMA: Thank you, Oscar.

  • Hi, everyone.

  • So we're going to focus a little more about reinforcement

  • learning and how this is different from other kinds

  • of machine learning-- unsupervised learning,

  • supervised learning, and other flavors.

  • Here's three examples.

  • One is a robotics game.

  • And the other one is a recommendation system.

  • That's a clear example where you can

  • apply reinforcement learning.

  • So the basic idea is--

  • so if you were to try to teach someone how to walk,

  • it's very difficult, because it's really difficult for me

  • to explain to you what you need to do to be able to walk--

  • coordinate your legs, in this case, of the robot-- or even

  • for a kid.

  • How you teach someone how to walk is really difficult.

  • They need to figure it out themselves.

  • How?

  • Trial and error.

  • You try a bunch of times.

  • You fall down.

  • You get up, and then you learn as you're falling.

  • And that's basically-- you can think of it like the reward

  • function.

  • You get a positive reward or a negative reward

  • every time you try.

  • So here, you can see also, even with the neural algorithms,

  • this thing is still hopping, no?

  • After a few trials of learning, this robot

  • is able to move around, wobble a little bit, and then fall.

  • But now he can control the legs a little more.

  • Not quite walk, but doing better than before.

  • After fully-trained, then, the robot

  • is able to walk from one place to another,

  • basically go to a specific location, and all those things.

  • So how this happen basically is summarizing this code.

  • Well, there's a lot of code, but overall, the presentation

  • will go about the details.

  • Basically, we're summarizing all the pieces

  • you will need to be able to train a model like this,

  • but we will go into details.

  • So what is reinforcement learning,

  • and how that is different, basically, for other cases?

  • The idea is we have an agent that

  • is trying to play, in this case, or interact

  • with an environment.

  • In this case, it's like Breakout.

  • So basically, the idea is you need

  • to move the paddle to the left or to the right to hit the ball

  • and break the bricks on the top.

  • So this one generates some observation

  • that the agent can observe.

  • It can basically process them for observation,

  • generate a new action-- like whether to move the paddle

  • to the left to the right.

  • And then based on that, they will get some reward.

  • In this case, it will be the score.

  • And then, using that information,

  • it will learn from this environment how to play.

  • So one thing with this, I think, is

  • critical for people who have done

  • a lot of supervised learning is, what

  • is the main difference between supervised learning

  • and reinforcement learning-- is that for supervised learning,

  • you can think of, for every action

  • that you take, they give you a label.

  • An expert will have labeled that case.

  • That is simple.

  • It'll give you the right answer.

  • For this specific image, this is an image of a dog.

  • This is an image of a cat.

  • So you know what is the right answer,

  • so every time you make a mistake,

  • I will tell you what is the right answer to that question.

  • In reinforcement learning, that doesn't happen.

  • Basically, you are playing this game.

  • You are interacting with the game.

  • You [? bench ?] a batch of actions,

  • and you don't know which one was the right action, what

  • was the correct action, and what was the wrong one.

  • You only know this reward function tells you, OK, you

  • are doing kind of OK.

  • You are not doing that well.

  • And based on that, you need to infer, basically,

  • what other possible actions you could have taken to improve

  • your reward, or maybe you're doing well now, but maybe later

  • you do worse.

  • So it's also a dynamic process going on over here.

  • AUDIENCE: How is the reward function

  • different from the label?

  • SERGIO GUADARRAMA: So I think the main difference is this.

  • The reward function is only an indicator

  • you are doing well or wrong, but it

  • doesn't tell you what is the precise action you

  • need to take.

  • The label is more like the precise outcome of the model.

  • You can think, in supervised learning,

  • I tell you what is the right action.

  • I tell you the right answer.

  • If I give you a mathematical problem, I'm going to say,

  • x is equal to 2.

  • That is the right answer.

  • If I tell you, you are doing well,

  • you don't know what was the actual answer.

  • You don't know if it was x equal 2 or x equal 3.

  • If I tell you it's the wrong answer,

  • you're going to know what was the right answer.

  • So basically that's the main difference between having

  • a reward function that only indicates--

  • it gives you some indication about whether you are doing

  • well or not, but doesn't give you the proper answer--

  • or the optimal answer, let's say.

  • AUDIENCE: Is the reward better to be very general instead

  • of very specific?

  • SERGIO GUADARRAMA: Mhm.

  • AUDIENCE: Like you are doing well,

  • instead of what you are moving is the right direction to go.

  • OSCAR RAMIREZ: It depends quite a bit on the environment.

  • And there is this whole problem of credit assignment.

  • So trying to figure out what part of your actions

  • were the ones that actually led to you receiving this reward.

  • So if you think about the robot hopping,

  • you could give it a reward, that may be

  • its current forward velocity.

  • And you're trying to maximize that,

  • and so the robot should learn to run as fast as possible.

  • But maybe bending the legs down so

  • you can push yourself forward will help you move forward

  • a lot faster, but maybe that action will actually move you

  • backwards a little bit.

  • And you might even get punished instantaneously

  • for that action, but it's part of the whole set of actions

  • during an episode that will lead you to moving forward.

  • And so the credit assignment problem is like,

  • all right, there are a set of actions

  • that we might have even gotten negative reward,

  • but we need to figure out that those actions led

  • to positive reward down the line.

  • And the objective is to maximize the discounted return.

  • So a sum of rewards over a length of time steps.

  • SERGIO GUADARRAMA: Yeah, that's a thing that's a critical part.

  • We care about long-term value.

  • It's not totally immediate reward.

  • It's not only you telling me plus 1.

  • It's not so important, because I want

  • to know not if I'm playing the game right now.

  • If I'm going to win the game at the end,

  • that's where I really care.

  • I am going to be able to move the robot to that position.

  • What we're in the middle, sometimes those things are OK.

  • Some things are not bad.

  • But sometimes, I make an action.

  • Maybe I move one leg and I fall.

  • And then I could not recover.

  • But then maybe it was a movement I did 10 steps ago

  • would make my leg wobble.

  • And now how do I connect which action made me fall.

  • Sometimes it's not very clear.

  • Because it's multiple actions-- in some cases,

  • even thousands of actions-- before you

  • get to the end of the game, basically.

  • You can think also that, in the games, now

  • that I've gotten all those things,

  • is this [? stone ?] really going to make you lose?

  • Probably there's no single [? stone ?]

  • that's going to make you lose, but 200 positions down

  • the line, that [? stone ?] was actually very critical.

  • Because that has a ripple effect on other actions

  • that happen later.

  • And then to you need to be able to estimate this credit

  • assignment for which actions I need to change to improve

  • my reward, basically, overall.

  • So I think this is to illustrate also a little farther,

  • different models of learning.

  • What we said before, supervised learning

  • is more about the classical classroom.

  • There's a teacher telling you the right

  • answer, memorize the answer, memorize that.

  • And that's what we do in supervised learning.

  • We almost memorize the answers with some generalization.

  • Mostly that's what we do.

  • And then in reinforcement learning,

  • it's not so much about memorize the answer.

  • Because even if I do the same actions,

  • in a different setting, if I say,

  • OK, go to the kitchen in my house, and I say,

  • oh, go to the left, second door to the right.

  • And then I say, OK, now go to [? Kate's ?] house

  • and go to the kitchen.

  • If you apply the same algorithm, you

  • will get maybe into the bathroom.

  • Like, you'll go two doors to the right,

  • and then you go to the wrong place.

  • So even memorizing the answer is not good enough.

  • You know what I mean?

  • You need to adapt to the environment.

  • So that's what makes reinforcement learning

  • a little more challenging, but also more

  • applicable to many other problems in reality.

  • You need to play around.

  • You need to interact with the environment.

  • There's no such a thing as, I can

  • think about what's going to be the best plan ahead of time

  • and never play with the environment.

  • We tried to write down some of these things

  • that we just mentioned, about that you

  • need to interact with the environment

  • to be able to learn.

  • This is very critical.

  • If you don't interact, if you don't try to walk,

  • if the robot doesn't try to move,

  • it cannot learn how to walk.

  • So you need to interact with the environment.

  • Also it will put you in weird positions and weird places,

  • because you maybe ended up in the end of a corridor,

  • or in a [INAUDIBLE] position, or maybe even unsafe cases.

  • There's another research also going on about safe RL.

  • How do I explore the world such as I don't break my robot?

  • Like, you make a really strong force, you may break the robot.

  • But probably you don't want to do that.

  • Because you need to keep interacting with the world.

  • And also, we collect data while we're training.

  • So as we're learning, we're collecting new data, fresh data

  • all the time.

  • So the data set is not fixed like in supervised learning.

  • We typically assume in supervised learning

  • that we have initial data set at the beginning,

  • and then you just iterate over and over.

  • And here, as you learn, you get fresh data,

  • and then the data changes.

  • The distribution of the data changes as you get more data.

  • And you can see that also for example in a labyrinth.

  • You don't know where you're going.

  • At the beginning, you're probably lost all the time.

  • And you maybe ended up always in the same places.

  • And maybe there's different part of the labyrinth you never

  • explore.

  • So you don't even know about that.

  • So you cannot learn about it, because you have never explored

  • it.

  • So the exploration is very critical in RL.

  • It's not only you want to optimize and exploit

  • the model that you have.

  • You also need to explore.

  • Sometimes, you actually need to do

  • what you think is the wrong thing, which is basically

  • go to the left here, because you've never been there,

  • just to basically explore new areas.

  • Another thing is like what we said

  • before, nobody's going to tell you what is the right answer.

  • And actually, many cases there's not a right answer.

  • There are multiple ways to solve the problem.

  • The reward only gives you an indication

  • you are going the right path or not.

  • But it doesn't tell you what is the right answer.

  • To train this model, we use a lot

  • of different surrogate losses, which

  • means also they are not actually correlated with performance.

  • Usually, it's very common, and you will see in a moment--

  • when the model is learning, the loss goes up.

  • When the model is not learning, the loss goes down.

  • So basically, loss going down is usually a bad sign.

  • If your losses stay at zero, you are learning nothing.

  • So you will see in a second, how you

  • debug these models became more and more tricky

  • than supervised learning.

  • We look at loss, our losses go down.

  • Beautiful.

  • And you take [INAUDIBLE] and the loss always goes down.

  • You do something wrong, the loss goes up.

  • Otherwise, the loss keeps going down.

  • In RL, that's not the case.

  • First, we have multiple losses to train.

  • And many of them actually don't correlate with performance.

  • They will go up and down--

  • it looks like random, almost.

  • So it's very hard to debug or tune these algorithms

  • because of that.

  • You actually need to evaluate the model.

  • It's not enough, the losses is not

  • enough to give you a good sense if you are doing well or not.

  • In addition to that, that means we require multiple optimizers,

  • multiple networks, multiple ways to update the variables,

  • and all those things.

  • Which means the typical supervised learning training

  • loop or model fit doesn't fit for RL, basically.

  • There's many ways we need to update the variables.

  • None of them use optimizers, some of them

  • we have multiple optimizers with different frequencies,

  • with different ways.

  • Sometimes we optimize one model against a different model

  • and things like that.

  • So basically, how we update the models

  • is very different from the typical way

  • of supervised learning, even though we use some supervised

  • learning losses, basically.

  • Some of the losses are basically supervised,

  • basically regression losses, something like that,

  • or cross-entropy.

  • So we use some of those losses by different ways, basically.

  • So probably, this graph is not very

  • surprising to most people who have used supervised learning

  • in the last years.

  • It used to be different in the past.

  • But now with neural networks, it usually always looks like this.

  • You start training your model, your classification loss

  • goes down.

  • Usually your regularization goes up,

  • because your [INAUDIBLE] is actually learning something,

  • so they are moving.

  • But your total loss is still, the overall total losses

  • still go up.

  • Regularization loss tends to stabilize, usually,

  • or go down after learning.

  • But basically, usually you can guide yourself

  • by your cross-entropy loss or total loss

  • to be a really good guide that your model is learning.

  • And if the loss doesn't go down, then your model

  • is not learning, basically.

  • You know that.

  • I still remember when I was outside Google and trying

  • to train neural net, the first neural net.

  • And I couldn't get the loss down.

  • The loss was stable.

  • And initialization couldn't get it back down.

  • And then I need to ask Christian [? Szegedy, ?] like,

  • what do you do?

  • How do you did it?

  • He's like, oh, you need to initialize

  • the variables this way.

  • You have to do all these extra tricks.

  • And when I did all the tricks he told me,

  • all of a sudden the losses start going down.

  • But once the losses start going down,

  • the model starts learning very quickly.

  • This is what it looks in many cases in RL.

  • We have the actor loss that's going up.

  • In this case, it's actually good,

  • because it's learning something.

  • We have this alpha loss, which is

  • almost like noise around zero, fluctuates quite a bit.

  • And the critic loss in this case just collapsed, basically.

  • At the beginning, it was very high, and all of a sudden

  • it got very small, and then it doesn't move from there.

  • But this model is actually good.

  • This model is learning well.

  • [CHUCKLES] So you see all these--

  • and there's not like a total loss.

  • You cannot aggregate these losses,

  • because each one of these losses is optimized in different part

  • of the model.

  • So we optimize each one of them individually.

  • But the other cases that you see the losses,

  • and then usually, the loss will go up, especially

  • sometimes when you're learning something,

  • because you can think about it this way.

  • You are trying to go through the environment,

  • and you see a new room you've never seen.

  • It's going to be very surprising for the model.

  • So the model is going to try to fit this new data,

  • and it's basically going to be out of the distribution.

  • So the model is going to say, I don't know,

  • this looks really different to everything I've seen before.

  • So the loss goes up.

  • When it basically learns about this new data,

  • then the loss will go down again.

  • So it's very common that we have many patterns

  • that the loss goes up and down as the model starts learning

  • and discover more rooms and more spaces in the environment.

  • AUDIENCE: But how do we know the model is doing well if we

  • don't--

  • SERGIO GUADARRAMA: So we need to look basically at the reward.

  • So the other function that we said that we actually

  • compute the [INAUDIBLE] reward.

  • And then basically we take a model,

  • run it through the environment, and compute

  • how well it's performing.

  • So the loss itself doesn't tell us that.

  • AUDIENCE: You're talking about not

  • the rewards during the training, but a separate reward

  • where you--

  • SERGIO GUADARRAMA: You can do both.

  • You can do both.

  • You can compute reward during training.

  • And that already give you a very good signal.

  • AUDIENCE: But during the training,

  • it would be misleading.

  • Because you haven't explored something,

  • then you won't see that it wasn't really good.

  • SERGIO GUADARRAMA: It's still misleading, exactly.

  • Yeah.

  • So we do usually both.

  • OSCAR RAMIREZ: And it's even more deceiving,

  • because when you have a policy that you're

  • using to collect data to train on, you most of time,

  • will have some form of exploration within that.

  • Every 10 steps you'll do a random action,

  • and that will lead to wildly different rewards over time.

  • AUDIENCE: But why is it not misleading even if you do it

  • separately from training?

  • Because ultimately, if your policy

  • is such that it doesn't really explore much, it will always--

  • when you throw that policy into a test environment,

  • and you no longer modify it, whatever,

  • but it might still-- if the policy is just very naive

  • and doesn't want to explore much,

  • it would look great, because it does everything fine.

  • But how would you know that it actually hasn't left--

  • OSCAR RAMIREZ: So when we're doing evaluations,

  • we want to exploit what we've learned.

  • So at that point, we're trying to use

  • this to complete the task that we're trying to accomplish

  • by training these models.

  • And so there, we don't need to explore.

  • Now we're just trying to exploit what we've learned.

  • AUDIENCE: But if it's not ready to react

  • to certain things that-- like, if it hasn't explored the space

  • so that in common situations it would still do well,

  • but it hasn't explored it enough that if it encounters

  • some issues it doesn't know what to do,

  • then that would not be really reflected by the reward.

  • OSCAR RAMIREZ: Yeah.

  • So you need to evaluate over a certain number of episodes.

  • And Sergio has a slide on like--

  • SERGIO GUADARRAMA: Like, maybe--

  • probably what you say.

  • Like, actually, evaluating once is not enough.

  • We usually evaluate it maybe 100 times,

  • from different initial conditions, all of that.

  • And then we average.

  • Because it's true.

  • It could be, you evaluate once, maybe it looks fine.

  • You never went to the wrong place.

  • You never fall off the cliff.

  • You're totally fine.

  • You evaluate 100 times, one of them will go off the cliff.

  • Because it was going to be one situation [INAUDIBLE] as well.

  • AUDIENCE: Also, do you mind clarifying

  • the three types of losses, what they correspond to?

  • SERGIO GUADARRAMA: So basically here,

  • the actor loss here corresponds to this policy,

  • distinguish acting in the environment.

  • Like, I need to make a decision about which action to take.

  • So we have a model which is basically saying,

  • which action I'm going take right now.

  • I'm going to move the paddle to the left or to the right?

  • So that will be your actor.

  • And we have a loss to train that model.

  • Then the critic loss is slightly different.

  • It's going to say, OK, if I'm in this situation

  • and I were to perform this action, how good will that be?

  • So I can decide should I take right, or should I take left?

  • So it's trying to give me a sense,

  • is this action good in this state?

  • And then basically, that's what we call the critic.

  • And then usually, the critic is used to train the actor.

  • So the actor will say, oh, I'm going to go to the right.

  • And the critic will say, oh, you go to the right, that's

  • really bad.

  • Because I know I give you a score by negative score.

  • So you should go to the left.

  • But then the critic then will learn the critic, basically,

  • by seeing these rewards that we [? observe ?] during training.

  • Then that gives us basically this [? better ?]

  • reward that the critic can learn from.

  • So the critic is basically regressing to those values.

  • So that's the loss for the critic.

  • And in this case, this alpha loss

  • is basically how much exploration, exploitation I

  • should do.

  • It's like, how much entropy do I need

  • to add to my model in the actor?

  • And usually, you want to have quite a bit

  • at the beginning of learning.

  • And then when you have a really good model,

  • you don't want to explore that much.

  • So this alpha loss is basically trying

  • to modulate how much entropy do I want to add in my model.

  • AUDIENCE: So I have often the entropy

  • is going up during our training.

  • But why the actor loss in your example

  • is also constantly going up in your training?

  • SERGIO GUADARRAMA: In the actor loss?

  • AUDIENCE: The actor loss.

  • OSCAR RAMIREZ: Yeah.

  • So basically, what happened is, as I mentioned,

  • the actor loss is trained based on the critic also.

  • So basically, the actor is trying

  • to predict which actions should I make?

  • And the critic is trying to criticize, this is good,

  • this is bad.

  • So the critic is also moving.

  • So the critic, as the critic learns and be better at scoring

  • this is a good action or not, then the actor

  • needs to change to that.

  • So this also, you can think also of this

  • is like a game going on a little bit.

  • You know, it's not exactly a game,

  • because they don't compete against each other.

  • But it's like a moving target.

  • And sometimes, the better the critic,

  • the less the actor needs to move around.

  • Usually it's stabilized.

  • The actor loss tends to stabilize way more

  • than the critic loss.

  • The critic loss I have seen in other cases--

  • this one is very stable.

  • But in many other cases, the critic loss

  • goes up and down much more substantially.

  • And going back to the question that you did before about,

  • how do we know we're doing well?

  • Because what I told you so far is like,

  • there's all these losses that doesn't correlate.

  • When we evaluate, we actually don't

  • know how well are we doing.

  • And even more profound is like, you

  • look to the graph on the left, where there's actually

  • two graphs, same algorithm trying to solve the same task--

  • the orange and the blue.

  • The only difference between these two graphs--

  • so higher is better, so higher return like this,

  • you are getting better performance--

  • is actually statistically much better than the blue one.

  • But the only difference between these two runs

  • are the random seeds.

  • Everything else is the same.

  • It's the same code, the same task.

  • Everything is the same.

  • The only thing that changed is the random seed.

  • It's basically how the model was initialized.

  • AUDIENCE: The random seed for the training,

  • or the random seed for the evaluation?

  • OSCAR RAMIREZ: The random seed for the training.

  • Yeah.

  • And then for the evaluation, we will usually run probably--

  • I don't remember-- probably 100 random seeds

  • different for every time that you're evaluating here,

  • you would run 100.

  • So to tackle this, what we did is, this work with Stephanie,

  • we were like, can we actually measure

  • how reliable is an algorithm?

  • Because RL algorithms are not very reliable,

  • and it's really hard to compare one algorithm

  • to another, one task to another, and all those things.

  • So we basically did a lot of work.

  • And we have a paper on the code available to basically measure

  • these things.

  • Like, can I statistically measure, is

  • this algorithm better than this one?

  • And not only is it better, is it reliable?

  • Because if I train 10 times and I get 10 different answers,

  • maybe one of them is good.

  • But it's not very reliable.

  • I cannot apply to a real problem,

  • because every time I train, I get a very different answer.

  • So basically, the broader these curves are, the less reliable

  • it is, because I will get every time I train--

  • I think this one we trained 30 different times--

  • and then you see some algorithms will have broader bands,

  • and some others will have narrow bands.

  • So the algorithm that have narrow bands are more reliable.

  • So we have ways to measure those, different metrics.

  • AUDIENCE: But don't you only care about the final point?

  • Why would you care about the intermediate points?

  • SERGIO GUADARRAMA: You care about both,

  • because let's think about it like, for example,

  • if you cannot reliably get the final point, it's not good.

  • If one algorithm say--

  • we have some algorithms that do that.

  • It's not here, because they are so bad.

  • Like only one in 100 that will get a really high number.

  • You train 100 times, one of them will be really good,

  • 99 will be really bad.

  • So the question of which algorithm

  • do you want to use for your model?

  • One that 1 in 100 times you run will give you a good answer,

  • and it would be really good?

  • Or some one which is maybe not as good,

  • but consistently will give me maybe 90% of the other one?

  • So basically, we provide different metrics

  • so you can measure all those different things.

  • But be mindful of what you choose.

  • The final score is not the only thing

  • that you care, usually, for comparing algorithms.

  • You just want a policy, like you just

  • want to solve this problem, yeah,

  • the final score is the only thing you care.

  • But if we want to compare algorithms, I want to compare,

  • can I apply this algorithm to a new task?

  • If I need to run it 100 times every time I change the task,

  • it's not going to be a very good, very reliable algorithm.

  • OK.

  • I think we're back to Oscar.

  • OSCAR RAMIREZ: Cool.

  • So now that we saw all the problems,

  • let's see what we actually do in TF-Agents,

  • try and address and make it possible to play

  • with these things.

  • So to look at a bigger picture of the components

  • that we have within TF-Agents, we

  • have a very strict separation of how

  • we do our data collection versus how we do our training.

  • And this has been mostly out of necessity,

  • where we need to be able to do data

  • collection in a whole bunch of different types

  • of environments, be it in some production system,

  • or on actual real robots, or in simulations.

  • And so we need to be able to somehow deploy

  • these policies that were being trained by these agents,

  • interact with this environment, and then store all this data so

  • that we can then sample it for training.

  • And so we started looking first at what

  • do the environments actually look like.

  • And if you look at RL and a lot of the research,

  • there is OpenAI Gym and a lot of other environments

  • available through that.

  • And so for TF-Agents, we make all these available and easy

  • to use within the library.

  • This is just a sample of the environments

  • that are available.

  • And so defining the environments,

  • we have this API, where we can define the environment.

  • Let's for example think, what happens

  • if we want to define Breakout?

  • The first thing that you need to do

  • is define what your observations and actions

  • are going to look like.

  • This comes a little bit back from when

  • we started when we were still in TF 1,

  • and we really needed these information

  • for building the computation graph.

  • But it's still very useful today.

  • And so what these specs, they're basically

  • nested structures of TensorFlow specs

  • that fully define the shapes and types of what

  • the observations will look like and what

  • the actions will look like.

  • And so we think, specifically for Breakout,

  • maybe the observation will be the image of the game screen.

  • And the actions will probably be moving the paddle left,

  • moving it right, and maybe firing, so

  • that you can actually launch the ball.

  • So once you've defined what your data

  • is going to look like, there's two main methods

  • and environments that you have to define as a user--

  • how the environment gets reset, and how the environment gets

  • stepped.

  • And so a reset will basically initialize

  • the state of the environment and give you

  • the initial observation.

  • And when you're stepping, you'll receive some action.

  • If the state of the environment is

  • that we reach the final state, it will automatically

  • reset the environment.

  • Otherwise, it will use that action to transition

  • from your current state to a next state.

  • And this will give you the next state's observation

  • and some reward.

  • And we encapsulate this into a time step that includes

  • that kind of information.

  • And so if we're wanting to play Breakout,

  • we would create an instance of this environment.

  • We'll get some policy, either scripted or from some agent

  • that we're training.

  • And then we would simply iterate to try and figure out,

  • all right, how well are we doing over an episode?

  • This is basically a simplification

  • of what the code would look like if we were trying

  • to evaluate how good a specific policy is on some environment.

  • In order to actually scale and train this,

  • it means that we actually have to be collecting a lot of data

  • to be able to train on these environments

  • and with these methods.

  • And so we provide the tooling to be able to parallelize this.

  • And so you can create multiple instances of this environment

  • and collect data in a batch setting, where

  • we have this TensorFlow wrapper around the environment that

  • will internally use NumPy functions to interact

  • with the Python environment, and will then

  • batch all of these instances and give us batched time

  • steps whenever we do the reset.

  • And then we can use the policy to evaluate and generate

  • actions for every single instance of this environment

  • at the same time.

  • And so normally when training, we'll

  • deploy several jobs that are doing collection

  • in a bunch of environments at the same time.

  • And so once we know how to interact with the environment,

  • you can think of the driver and the observer.

  • These are basically like a For loop.

  • There's an example down the line.

  • But all of that data will be collected somewhere.

  • And in order to do training, what we do

  • is we rely on the data set APIs to be

  • able to sample experience out of the data sets

  • that we're collecting.

  • And the agent will be consuming this experience

  • and will be training the model that it has.

  • In most situations, it's a neural network.

  • In some of the algorithms, it's not even a neural network,

  • in examples like bandits.

  • And so we're trying to train this learnable policy based

  • purely on the experience, that is, mostly the observations

  • that we've done in the past.

  • And what this policy needs to do is,

  • it's a function that maps from some form of an observation

  • to an action.

  • And that's what we're trying to train in order

  • to maximize our long-term rewards over some episode.

  • And so how are these policies built?

  • Well, first we'll have to define some form of network to back it

  • or to generate the model.

  • In this case, we inherit from the Keras networks

  • and add a couple of utility things,

  • especially to be able to generate

  • copies of these networks.

  • And here will basically define, all right, we'll

  • have a sequential model with some conv layers, some fully

  • connected layers.

  • And then if this was, for example, for DQN,

  • we would have a last layer that would give us a predicted Q

  • value, which is basically predicting how good is

  • this action at a given state, and would tell us

  • what probabilities we should be sampling the different kinds

  • of actions that we have.

  • And then within the call method, we'll

  • be taking some observation.

  • We'll iterate over our layers and generate some predictions

  • that we want to use to generate actions.

  • And then we have this concept of a policy.

  • And the policy, what it will do is, it will know,

  • given whatever algorithm we're trying

  • to train, the type of network that you're training

  • might be different.

  • And so in order to be able to generalize

  • across the different algorithms or agents

  • that we're implementing, the concept

  • of the policy we'll know, given some set of networks,

  • how do we actually use these to take observations and generate

  • actions.

  • And normally, the way we do this is

  • that we have a distribution method that

  • will take this time step and maybe some policy state--

  • whatever you're training, some recurring models, for example--

  • and we'll be able to apply this network

  • and then know how to use the output of the network

  • in order to generate either some form of distribution--

  • in some agents, this might be a deterministic distribution--

  • that we can then sample from.

  • And then when doing data collection,

  • we might be sampling from this distribution.

  • We might add some randomness to it.

  • When we're doing evaluations, we'd

  • be doing a greedy version of this policy, where we'll

  • take the mode of this distribution

  • in order to try to exploit the knowledge that we've gathered,

  • and try to maximize our return over the episodes

  • when evaluating.

  • And so one of the big things with 2.0

  • is that we can now rely on saved models

  • to export all these policies.

  • And this made it a lot easier to generalize

  • and be able to say, oh, hey, now it doesn't matter

  • what agent you use to train.

  • It doesn't matter how you generated your network.

  • You just have the saved model that you can call action on.

  • And you can deploy it on to your robots, production, wherever,

  • and collect data for training, for example, or for serving

  • the trained model.

  • And so within the saved model, we

  • generate all these concrete functions,

  • and save and expose an action method,

  • getting an initial state--

  • again, for the case where we have recurring models.

  • And we also get the training step,

  • which can be used for annotating the data that we're collecting.

  • And right now, the one thing that we're still working on,

  • or that we need to work on, is that we

  • rely on tests for probability for a lot of the distribution

  • stuff that we use.

  • But this is not part of core TensorFlow.

  • And so saved models can generate distributions easily.

  • And so we need to work on that a little bit.

  • The other thing that we do is that we

  • generate different versions of the saved model.

  • Depending on whether this policy will

  • be used for data collection versus for evaluation,

  • it'll have baked in whatever exploration strategy

  • we have within the saved model.

  • And right now, I'm working on making it so that we can easily

  • load checkpoints into the saved model and update the variables.

  • Because for a lot of these methods,

  • when we're generating the saved models,

  • we have to do this very frequently.

  • But the saved model, the computation graph

  • that it needs to generate, it's the same every step.

  • And so right now we're saving a lot of extra stuff

  • that we don't need to, and so just being

  • able to update it on the fly--

  • but overall, this is much easier than what we had to do in TF 1,

  • where we were stashing placeholders in collections,

  • and then being able to rewire how we were feeding data

  • into the saved models.

  • AUDIENCE: So one question about--

  • you talk about distribution part in saved model.

  • So if your function fit into saved model,

  • the save is already a distributed function, then

  • it should be able to support--

  • like, you can dump--

  • OSCAR RAMIREZ: So we can have the distributions within it.

  • But we can't easily look at those distributions

  • and modify them when we deploy it.

  • Like, the return of a saved model function cannot be

  • a distribution object.

  • It can only be the output of it.

  • SERGIO GUADARRAMA: It can only be a tensor, basically.

  • The only outputs that the concrete functions

  • take in and out are tensors.

  • It cannot be an actual distribution, not yet.

  • Because the other thing, sometimes we

  • need to do sampling logics.

  • We need to do functions that belong to the distribution

  • object.

  • AUDIENCE: I see.

  • SERGIO GUADARRAMA: So we do some tricks in replay buffer

  • and everything, basically, that it's stored information

  • that we need to reconstruct the distribution back.

  • I know this object is going to be a categorical distribution,

  • and because I know that then I can basically

  • get the parameters of the categorical distribution,

  • rebuild the object again with these parameters.

  • And now I can sample, I can do all these other things

  • from the distribution.

  • Through the saved model, it's still tricky.

  • I mean, we can still save that information.

  • But it's not very clear how much information

  • should be part of the saved models,

  • or it's part of us basically monkey patching the thing

  • to basically get what we need.

  • OSCAR RAMIREZ: And the other problem with it

  • is that, as we export all these different saved models to do

  • data collection or evaluation, we

  • want to be able to be general to what agent trained this,

  • what kind of policy it really is, and what kind of network

  • is backing it.

  • And so then trying to stash all that information in there

  • can be tricky as well to generalize over.

  • And so if we go back circle now, we have all these saved models,

  • and all these are basically being used for data collection.

  • And so collecting experience, basically, we'll

  • have, again, some environment.

  • Now we have an instance of this replay buffer,

  • where we'll be putting all this data that we're collecting on.

  • And we have this concept of a driver that will basically

  • utilize some policy.

  • This could be either directly from the agent,

  • or it could be a saved model that's

  • been loaded when we're doing it on a distributed fashion.

  • And we define this concept of an observer, which will--

  • as the driver is evaluating this policy with the environment,

  • every observer that's passed to the driver

  • will be able to take a look at the trajectory that

  • was generated at that time step and use it to do whatever.

  • And so in this case, we're adding it to the replay buffer.

  • If we're doing evaluation, we would be computing some metrics

  • based on the trajectories that we're observing, for example.

  • And so once you have that, you can actually

  • just run the driver and do the data collection.

  • And so if we look at the agents, we

  • have a whole bunch of agents that

  • are readily available in the open-source setup.

  • All of these have a whole bunch of tests, both quality

  • and speed regression tests, as well.

  • And we've been fairly selective to make sure that we pick

  • state-of-the-art agents or methods within RL that have

  • proven to be relevant over longer periods of time.

  • Because maintaining these agents is a lot of effort,

  • and so we have limited manpower to actually maintain these.

  • So we try to be conservative on what we expose publicly.

  • And so looking at how agents are defined in their API,

  • the main things that we want to do with an agent

  • is be able to access different kinds of policies

  • that we'll be using, and then being

  • able to train given some experience.

  • And so we have a collection policy

  • that you would use to gather all the experience that you

  • want to train on.

  • We have a train method that you feed in experience,

  • and you actually get some losses out,

  • and that will do the updates to the model.

  • And then you have the actual policy

  • that you want to use to actually exploit things.

  • In most agents, this ends up being a greedy policy,

  • like I mentioned, where in the distribution method

  • we would just call them out to actually get the best

  • action that we can.

  • And so putting it together with a network,

  • we instantiate some form of network that the agent expects.

  • We give that and some optimizer.

  • And there's a whole bunch of other parameters for the agent.

  • And then from the replay buffer, we can generate a data set.

  • In this case, for DQN, we need to train with transitions.

  • So we need like a time step, an action, and then

  • time step that happened afterwards.

  • And so we have this num_steps parameter equal to 2.

  • And then we simply sample the data set and do some training.

  • And yeah.

  • And so normally, if you want to do this sequentially, where

  • you're actually doing some collection and some training,

  • the way that it would look is that you

  • have the same components, but now we

  • alternate between collecting some data with the driver

  • and the environment, and training on sampling

  • the data that we've collected.

  • So this can sometimes have a lot of different challenges

  • where this driver is actually executing a policy

  • and interacting with a Python environment outside

  • of the TensorFlow context.

  • And so a lot of the [? eager ?] utilities

  • have come in really, really handy for doing

  • a lot of these things.

  • And so mapping a lot of these APIs back into the overview,

  • if we start with the replay buffer and go clockwise,

  • we'll have some replay buffer that we

  • can sample through data sets.

  • We'll have the concept of an agent,

  • for example DqnAgent, that we can train based on this data.

  • This is training some form of network that were defined.

  • And the network is being used by the policies

  • that the agents can create.

  • We can then deploy these, either through saved models

  • or in the same job, and utilize the drivers to interact

  • with the environment, and collect experience

  • through these observers back into the replay buffer.

  • And then we can iterate between doing data collection

  • and training.

  • And then recently, we had a lot of help

  • with getting things to work with TPUs, and accelerators,

  • and distribution strategies.

  • And so the biggest thing here is that, in order

  • to keep all these accelerators actually busy,

  • we really need to scale up the data collection rate.

  • And so depending on the environments--

  • for example, in some cases in the robotics use cases,

  • you might be able to get one or two time steps a second

  • of data collection.

  • And so then you need a couple of thousand jobs just

  • to do enough data collection to be able to do the training.

  • In some other scenarios, you might be collecting data

  • based on user interactions, and then you

  • might only get one sample per user per day.

  • And so then you have to be able to scale that up.

  • And then on the distributed side,

  • all the data that's being collected

  • will be captured into some replay buffer.

  • And then we can just use distribution strategies

  • to be able to sample that and pull it in, and then

  • distribute it across the GPUs or TPUs to do all the training.

  • And then I'll give it to Sergio for a quick intro into bandits.

  • SERGIO GUADARRAMA: So as we have been talking,

  • our role can be challenging in many cases.

  • So we're hoping this subset of RL,

  • what is called multi-armed bandits,

  • we will go a little bit.

  • But this simplifies some of the assumptions,

  • and it can be applied to a [INAUDIBLE] set of problems.

  • But they are much easier to train, much,

  • much easier to understand.

  • So I want to cover this, because for many people who

  • are new to RL, I recommend then to start with bandits first.

  • And then if they don't work still for your problem,

  • then you go and look into a full RL algorithm.

  • And basically, the main difference

  • between multi-armed and RL is basically,

  • here you make a decision every time,

  • but it's like every time you make a decision,

  • the game starts again.

  • So one action doesn't influence the others.

  • So basically, there's no such thing

  • as long-term consequences.

  • So you can make a decision every single time, and that will not

  • influence the state of the environment in the future,

  • which means a lot of things you can assume

  • are simplified in your models.

  • And this one, basically, you don't

  • need to worry about what actions did I take in the past,

  • how do I do credit assignment, because now it's very clear.

  • If I make this action and I get some reward,

  • it's because of this action, because there's

  • no more sequential [? patterns ?] anymore.

  • And also, here you don't need to plan ahead.

  • So basically, I don't need to think

  • about what's going to happen after I make

  • this action because it's going to have

  • some consequences later.

  • In the bandits case, we assume all the things are independent,

  • basically.

  • We assume every time you make an action,

  • you can start again playing the game from scratch

  • every single time.

  • This used to be done more commonly with A/B testing,

  • for people who know what A/B testing does.

  • It's like, imagine you have four different flavors of your, I

  • don't know, site, or problem, or four different options

  • you can offer to the user.

  • Which one is the best?

  • You offer all of them to different users,

  • and then you compute which one is the best.

  • And then after you figure out which one is the best,

  • then you serve that option to everyone.

  • So basically, what happens during the time

  • that you're offering these four options to everyone,

  • some people are getting not the optimal option, basically.

  • During the time you are exploring, figuring out

  • which is the best option, during that time some of the people

  • are not getting the best possible answer.

  • So that is called regret--

  • how much I could have done better

  • that I didn't do because I didn't give you the best

  • answer from the beginning.

  • So with multi-armed bandits, what its tries to do

  • is, as you go, adapt how much exploration do I need to do,

  • and how confident I am that my model is good.

  • So basically, it will start the same thing as A/B testing.

  • At the beginning, it will give a random answer to every user.

  • But as soon as some users say, oh, this is better, I like it,

  • it will start shifting and say, OK, I

  • should probably go to that option everybody seems

  • to be liking.

  • So as soon as you start figure out--

  • you are very confident your model is getting better,

  • then you basically start shifting and maybe serving

  • everyone the same answer.

  • So basically, the amount of regret,

  • how much time you have given the wrong answer, decreases faster.

  • So basically, the multi-armed bandit,

  • it tries to estimate how confident I am about my model.

  • When I'm not very confident, I explore.

  • When I become very confident, then I don't explore anymore,

  • I start exploiting.

  • One example that is typically used

  • for understanding multi-armed bandits is recommending movies.

  • You have a bunch of movies I could recommend you.

  • There's some probability that you may like this movie or not.

  • And then I have to figure out which movie to recommend you.

  • And then to make it even more personalized,

  • you can use context.

  • You can use user information.

  • You can use previous things as your context.

  • But the main thing is, you're not

  • going to make a recommendation today,

  • and that is doesn't influence the recommendation

  • I make tomorrow.

  • And so basically, if I knew this was the probability that you

  • like "Star Wars," I probably should

  • recommend you "Star Wars."

  • What happens is, before I start recommend you things,

  • I don't know what do you like.

  • Only when I start recommending you things and you

  • like some things and don't like other things,

  • then I learn about your taste, and then I

  • can update my model based on that.

  • So here, there are different algorithms in this experiment.

  • Some of them-- here, lower is better.

  • Is this regret?

  • It's like, how much can I offer you the optimal solution?

  • Some of them, they're basically very random,

  • and it takes forever, doesn't learn much.

  • Some of them, they just do this epsilon, really,

  • basically randomly give you something sometimes,

  • and otherwise the best.

  • And then there's other methods that use more fancy algorithms,

  • like Thompson sampling or dropout Thompson sampling,

  • where a more advanced algorithm that basically give you

  • better trade-off between exploration and exploitation.

  • So for all those things, we have tutorials,

  • we have a page on everything, so you can actually

  • play with all these algorithms and learn.

  • And I usually recommend, try to apply a bandit algorithm

  • to your problem first.

  • Because it makes more assumptions, but if it works,

  • it's better.

  • It's easier to train and easier to use.

  • If it doesn't work, then go back to the RL algorithms.

  • And these are some of them who are available currently

  • within TF-Agents.

  • Some of them I already mentioned.

  • Some of them use neural networks.

  • Some of them are more like linear models.

  • Some of them use upper bounds about the confidence.

  • So they try to estimate how confident I

  • am about my model and all those things

  • to basically get this exploration/exploitation

  • trade-off right.

  • As I mentioned, you can apply it to many of the recommender

  • systems.

  • You can imagine, I want to make a recommendation,

  • I never know what you like.

  • I try different things, and then based on that,

  • I improve my model.

  • And then this model gets very complicated

  • when you start giving personalized recommendations.

  • And finally, I want to talk a couple of things.

  • Some of them are about roadmaps, like where

  • is TF-Agents going forward.

  • Some of the things we already hit, but for example,

  • adding new algorithms and new agents.

  • We are working on that, for example, bootstrapped

  • DQN, I think, is almost ready to be open-sourced.

  • Before we open-source any of these algorithms, what we do

  • is we verify them.

  • We make sure they are correct, we get the right numbers.

  • And we also add to the continuous testing,

  • so they stay correct over time.

  • Because in the past, it would happen to us

  • also like, oh, we are good, it's ready, we put it out.

  • One week later, it doesn't work anymore.

  • Something changed somewhere in the--

  • who knows-- in our code base, in TensorFlow code base,

  • in TensorFlow probably.

  • Somewhere, something changed somewhere,

  • and now the performance is not the same.

  • So now we have this continuous testing

  • to make sure they stay working.

  • So we plan to have this leaderboard and pre-trained

  • model release, add in more distributed,

  • especially for replay buffers and distributed collection,

  • distributed training.

  • Oscar was mentioning at the beginning,

  • maybe thinking in the future to add another new environment,

  • like Unity or other environments that people are interested in.

  • This is a graph that I think is relevant for people

  • who are like, OK, how much time do you actually

  • spend doing the core algorithm?

  • You can think of this as the blue box.

  • Basically, that's the algorithm itself, not the agent.

  • And I would say probably 25% of total time

  • is developed into the actual algorithm and all those things.

  • All the other time is spent in other things within the team.

  • Replay buffer is quite a bit time-consuming.

  • TF 2, when we did the immigration for TF 1 to TF 2,

  • it took a really good chunk of our time

  • to make that migration.

  • Right now, our library you can run in both TF 1 and TF 2.

  • So we spent quite a bit of time to make sure that is possible.

  • All the core of the library you can run.

  • Only the binary is different, but the core of the library

  • can run in both TF 1 and TF 2.

  • And usability also, we spent quite a bit of time,

  • like how to make refined APIs.

  • Do I need to change this, how easy is it to use,

  • all those things.

  • And we still have a lot of work to do.

  • So we are not done with that.

  • And tooling.

  • All this testing, all this benchmarking,

  • all the continuous evaluation, all those things, this tooling,

  • we have to build around it to basically make

  • it be successful.

  • And finally, I think, for those of you who

  • didn't get the link at the beginning,

  • you can go to GitHub TensorFlow agents.

  • You can get the package by pip install.

  • You can start learning about using our Colabs

  • or tutorials with DQN-Cartpole.

  • The Minitaur that we saw at the beginning,

  • you can go and train yourself.

  • And the Colab was really good.

  • And to solve important problems.

  • That's the other part we really care

  • about is, make sure we are production quality.

  • The code base, the test, everything we do,

  • we can deploy these models and all the things

  • so you can actually use to solve important problems.

  • Not only-- we usually use games as an example,

  • because they're easy to understand and easy

  • to play around in.

  • But many other cases, we really apply to more real problems.

  • And actually, it's designed with that in mind.

  • We welcome contributions and pull requests.

  • And we try to review as best as we can with new environments,

  • with new algorithms, or new contributions to the library.

  • [MUSIC PLAYING]

OSCAR RAMIREZ: All right.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it