Subtitles section Play video
[MUSIC PLAYING]
JEFF DEAN: I'm excited to be here today to tell you
about how I see deep learning and how
it can be used to solve some of the really challenging problems
that the world is facing.
And I should point out that I'm presenting
the work of many, many different people at Google.
So this is a broad perspective of a lot of the research
that we're doing.
It's not purely my work.
So first, I'm sure you may have all noticed,
but machine learning is growing in importance.
There's a lot more emphasis on machine learning research.
There's a lot more uses of machine learning.
This is a graph showing how many Arxiv papers--
Arxiv is a preprint hosting service
for all kinds of different research.
And this is the subcategories of it
that are related to machine learning.
And what you see is that, since 2009, we've actually
been growing the number of papers posted at a really
fast exponential rate, actually faster than the Moore's Law
growth rate of computational power that we got so nice
and used to for 40 years but it's now slowed down.
So we've replaced the nice growth in computing performance
with growth in people generating ideas, which is nice.
And deep learning is this particular form
of machine learning.
It's actually a rebranding in some sense
of a very old set of ideas around creating
artificial neural networks.
These are these collections of simple trainable mathematical
units organized in layers where the higher layers typically
build higher levels of abstraction
based on things that the lower layers are learning.
And you can train these things end to end.
And the algorithms that underlie a lot of the work
that we're doing today actually were
developed 35, 40 years ago.
In fact, my colleague Geoff Hinton
just won the Turing Award this year along with Yann LeCun
and Yoshua Bengio for a lot of the work
that they did over the past 30 or 40 years.
And really the ideas are not new.
But what's changed is we got amazing results 30 or 40 years
ago on toyish problems but didn't
have the computational resources to make these approaches work
on real large scale problems.
But starting about eight or nine years ago,
we started to have enough computation to really make
these approaches work well.
And so what are things-- think of a neural net as something
that can learn really complicated functions that
map from input to output.
Now that sounds kind of abstract.
You think of functions as like y equals x squared or something.
But really these functions can be very complicated
and can learn from very raw forms of data.
So you can take the pixels of an image
and train a neural net to predict
what is in the image as a categorical label like that's
a leopard.
That's one of my vacation photos.
From audio wave forms, you can learn
to predict a transcript of what is being said.
How cold is it outside?
You can learn to take input in one language-- hello,
how are you--
and predict the output being that sentence translated
into another language.
[SPEAKING FRENCH]
You can even do more complicated things
like take the pixels of an image and create a caption that
describes the image.
It's not just category.
It's like a simple sentence.
A cheetah lying on top of a car, which is kind of unusual
anyway.
Your priority for that should be pretty low.
And in the field of computer vision,
we've made great strides thanks to neural nets.
In 2011, the Stanford ImageNet contest,
which is a contest held every year,
the winning entry did not use neural nets.
That was the last year.
The winning entry did not use neural nets.
They got 26% error.
And that won the contest.
We know this task--
it's not a trivial task.
So humans themselves have about 5%
error, because you have to distinguish
among 1,000 different categories of things
including like a picture of a dog, you have to say which
of 40 breeds of dog is it.
So it's not a completely trivial thing.
And in 2016, for example, the winning entry got 3% error.
So this is just a huge fundamental leap
in computer vision.
You know, computers went from basically not
being able to see in 2011 to now we can see pretty darn well.
And that has huge ramifications for all kinds of things
in the world not just computer science
but like the application of machine learning and computing
to perceiving the world around us.
OK.
So the rest of this talk I'm going
to frame in a way of-- but in 2008, the US National
Academy of Engineering published this list of 14
grand engineering challenges for the 21st century.
And they got together a bunch of experts
across lots of different domains.
And they all collectively came up
with this list of 14 things, which
I think you can agree these are actually
pretty challenging problems.
And if we made progress on all of them,
the world would be a healthier place.
We'd have a safer place.
We'd have more scientific discovery.
All these things are important problems.
And so given the limited time, what I'm going to do
is talk about the ones in boldface.
And we have projects in Google Research that are focused
on all the ones listed in red.
But I'm not going to talk about the other ones.
And so that's kind of the tour of the rest of the talk.
We're just going to dive in and off we go.
I think we start with restoring and improving
urban infrastructure.
Right.
We know cities were designed-- the basic structure of cities
has been designed quite some time ago.
But there's some changes that we're
on the cusp of that are going to really dramatically change how
we might want to design cities.
And, in particular, autonomous vehicles
are on the verge of commercial practicality.
This is from our Waymo colleagues, part of Alphabet.
They've been doing work in this space for almost a decade.
And the basic problem of an autonomous vehicle
is you have to perceive the world around you
from raw sensory inputs, things like light [INAUDIBLE],,
and cameras, and radar, and other kinds of things.
And you want to build a model of the world and the objects
around you and understand what those objects are.
Is that a pedestrian or a light pole?
Is it a car that's moving?
What is it?
And then also be able to predict both a short time from now,
like where is that car going to be in one second,
and then make a set of decisions about what actions
you want to take to accomplish the goals,
get from A to B without having any trouble.
And it's really thanks to deep learning vision
based algorithms and fusing of all the sensor data
that we can actually build maps of the world
like this that are understandings
of the environment around us and actually
have these things operate in the real world.
This is not some distant far off dream.
Waymo is actually operating about 100 cars
with passengers in the back seat and no safety
drivers in the front seat in the Phoenix, Arizona area.
And so this is a pretty strong sense
that this is pretty close to reality.
Now Arizona is one of the easier self-driving car environments.
It's like it never rains.
It's too hot so there aren't that many pedestrians.
The streets are very wide.
The other drivers are very slow.
Downtown San Francisco is harder,
but this is a sign that it's not that far off.
Obviously, a vision works, it's easier
to build robots that can do things in the world.
If you can't see, it's really hard to do things.
But if you can start to see, you can actually
have practical robotics things that
use computer vision to then make decisions about how
they should act in the world.
So this is a video of a bunch of robots practicing
picking things up, and then dropping them and picking
more things up, and essentially trying to grasp things.
And it turns out that one nice thing about robots
is you can actually collect the sensor data
and pool the experience of many robots,
and then collectively train on their collective experience,
and then get a better model of how
to actually grasp things, and then push that out
to the robots.
And then the next day they can all
practice with a slightly better grasping model,
because unlike humans that you plop
on the carpet in your living room,
they don't get to pool their experience.
OK.
So in 2015, the success rate on a particular grasping task
of grasping objects that a robot has never seen before
was about 65%.
When we use this kind of arm farm--
that's what that thing is called.
I wanted to call it the armpit, but I was overruled.
Basically, by collecting a lot of experience,
we were actually able to get a pretty significant boost
in grasp success rate, up to 78%.
And then with further work on algorithms and more refinement
of the approach, we're now able to get a 96% grasp success
right.
So this is pretty good progress in three years.
We've gone from a third of the time you fail to pick something
up, which is very hard to actually string together
a whole sequence of things and actually have robots actually
do things in the real world, to grasping almost working quite
reliably.
So that's exciting.
We've also been doing a lot of work
on how do we get robots to do things more easily.
Rather than having them practice themselves,
maybe we can demonstrate things to them.
So this is one of our AI residents doing work.
They also do fantastic machine learning research,
but they also film demonstration videos for these robots.
And what you see here is a simulated robot
trying to emulate from the raw pixels of the video what
it's seeing.
And on the right, you see a few demonstrations of pouring
and the robot using those video clips,
five or 10 seconds of someone pouring something,
and some reinforcement learning based trials to attempt
to learn to pour on its own.
After 15 trials and about 15 minutes of training,
it's able to pour that well, I would
say like at the level of a four-year-old
not an eight-year-old.
But that's actually much--
in 15 minutes of effort, it's able to get
to that level of success, which is a pretty big deal.
OK.
One of the other areas that was in the grand challenges
was advanced health informatics.
I think you saw in the keynote yesterday
the work on lung cancer.
We've also been doing a lot of work
on an eye disease called diabetic retinopathy, which
is the fastest growing cause of blindness in the world.
There's 115 million people in the world with diabetes.
And each of them ideally would be screened every year
to see if they have diabetic retinopathy, which
is a degenerative eye disease that if you catch in time
it's very treatable.
But if you don't catch it in time,
you can suffer full or partial vision loss.
And so it's really important that we
be able to screen everyone that is at risk for this.
And yeah.
Regular screening.
And that's the image that you get
to see as an ophthalmologist.
And in India, for example, there's
a shortage of more than 100,000 eye doctors
to do the necessary amount of screening of this disease.
And so 45% of patients suffer vision loss
before they're diagnosed, which is tragic,
because it's a completely preemptible thing if you
catch it in time.
And basically, the way an ophthalmologist looks at this
is they look at these images and they grade it
on a five point scale, one, two, three, four, or five,
looking for things like these little hemorrhages
that you see on the right hand side.
And it's a little subjective.
So if you ask two ophthalmologists
to grade the same image, they agree on the score, one, two,
three, four, or five, 60% of the time.
And if you ask the same ophthalmologist
to grade the same image a few hours later,
they agree with themselves 65% of the time.
And this is why second opinions are useful in medicine,
because some of these things are actually quite subjective.
And it's actually a big deal because the difference
between a two and a three is actually go away and come
back in a year versus we better get you
into the clinic next week.
Nonetheless, this is actually a computer vision problem.
And so instead of having a classification of a thousand
general categories of dogs and leopards,
you can actually just have five categories of the five
levels of diabetic retinopathy and train
the model on eye images and an assessment
of what the score should be.
And if you do that, you can actually
get the images labeled by several ophthalmologists, six
or seven, so that you reduce the variance that you already
see between ophthalmologists assessing the same image.
Five of them say it's two.
Two of them say it's a three, it's probably more like a two
than a three.
And if you do that, then you can essentially
get a model that is on par or slightly better
than the average board certified ophthalmologist that's
set at doing this task, which is great.
This is work published at the end of 2016
by my colleagues in "JAMA," which is a top medical journal.
We wanted to do even better though.
So it turns out you can actually, instead of--
you can get the images labeled by retinal specialists who
have more training in retinal eye disease.
And instead of getting independent assessments,
you get three retinal specialists
in a room for each image.
And you essentially say, OK, you all
have to come up with an adjudicated number.
What number do you agree on for each image?
And if you do that, then you can train
on the output of this consensus of three retinal specialists.
And you actually now have a model
that is on par with retinal specialists, which
is the gold standard of care in this area,
rather than the not as good model
trained on an ophthalmologist's opinion.
And so this is something that we've
seen born out where you have really good high quality
training data and you can actually
then train a model on that and get
the effects of retinal specialists into the model.
But the other neat thing is you can actually have
completely new discoveries.
So someone new joined the ophthalmology research team
as a warm up exercise to understand
how our tools worked.
Lily Peng, who is on the stage yesterday,
said, oh, why don't you go see if you
can predict age and gender from the retinal image
just to see if the machine learning pipeline--
a person could get that machine learning pipeline going?
And ophthalmologists can't predict gender
from an eye image.
They don't know how to do that.
And so Lilly thought the average that you see on this
should be no better than flipping a coin.
You see a 0.5.
And the person went away and they
said, OK, I've got it done.
My AUC is 0.7.
And Lilly is like, hmm, that's weird.
Go check everything and come back.
And so they came back and they said,
OK, I've made a few improvements.
It's now 0.8.
That got people excited because all of a sudden
we realized you can actually predict
a whole bunch of interesting things from a retinal image.
In particular, you can actually detect
someone's self-reported sex.
And you can predict a whole bunch of other things
like their age, things about their systolic and diastolic
blood pressure, their hemoglobin level.
And it turns out you combine those things together
and you can get a prediction of someone's cardiovascular risk
at the same level of accuracy that normally a much more
invasive blood test where you have to draw blood, send it off
to the lab, wait 24 hours, get the lab test back.
Now you can just do that with a retinal image.
So there's real hope that this could be a new thing
that if you go to the doctor you'll get
a picture of your eye taken.
And we'll have a longitudinal history of your eye
and be able to learn new things from it.
So we're pretty excited about that.
A lot of the grand challenges were around understanding
molecules and chemistry better.
One is engineer better medicines.
But this work that I'm going to show you
might apply to some of these other things.
So one of the things quantum chemists want to be able to do
is predict properties of molecules.
You know, will this thing bind to this other thing?
Is it toxic?
What are its quantum properties?
And the normal way they do this is they
have a really computationally expensive simulator.
And you plug in this molecule configuration.
You wait about an hour.
And at the end of that you get the output, which says, OK,
here are the things the simulator told you.
So it turns out-- and it's a slow process.
You can't consider that many different molecules
like you might like to.
It turns out you can use the simulator
as a teacher for a neural net.
So you can do that.
And then all of a sudden you have a neural net
that can basically learn to do what the simulator can
do but way faster.
And so now you have something that
is about 300,000 times faster.
And you can't distinguish the accuracy
of the output of the neural net versus the simulator.
And so that's a completely game changing thing
if you're a quantum chemist.
All of a sudden your tool is sped up by 300,000 times.
And all of a sudden that means you
can do a very different kind of science.
You can say, oh, while I'm going to lunch
I should probably screen 100 million molecules.
And when I come back, I'll have 1,000
that might be interesting.
So that's a pretty interesting trend.
And I think it's one that will play out
in lots and lots of different scientific fields
or engineering fields where you have this really
expensive simulator but you can actually
learn to approximate it with a much cheaper neural net
or machine learning based model and get
a simulator that's much faster.
OK.
Engineer the tools of scientific discovery.
I have a feeling this 14th one was just
kind of a vague catch all thing that the panel of experts that
was convened decided should do.
But it's pretty clear that if machine learning is going
to be a big part of scientific discovery and engineering,
we want good tools to express machine learning algorithms.
And so that's the motivation for why
we created TensorFlow is we wanted to be to have tools
that we could use to express our own machine learning ideas
and share them with the rest of the world,
and have other researchers exchange machine learning ideas
and put machine learning models into practice in products
and other environments.
And so we released this at the end of 2015
with this Apache 2.0 license.
And basically it has this graph based computational model
that you can then optimize with a bunch of traditional compiler
optimizations and it then can be mapped
onto a variety of different devices.
So you can run the same computation
on CPUs or GPUs or our TPUs that I'll tell you about in
a minute.
Eager Mode makes this graph implicit rather than explicit,
which is coming in TensorFlow 2.0.
And the community seems to have adopted
TensorFlow reasonably well.
And we've been excited by all the different things
that we've seen other people do, both in terms
of contributing to the core TensorFlow system
but also making use of it to do interesting things.
And so it's got some pretty good engagement kinds of stats.
50 million downloads for a fairly obscure programming
packages is a fair number that seems
like a good mark of traction.
And we've seen people do things.
So I mentioned this in the keynote yesterday.
I like this one.
It's basically a company building fitness center
for cows so you can tell which of your 100 dairy cows
is behaving a little strangely today.
There is a research team at Penn State and the International
Institute of Tropical Agriculture in Tanzania
that is building a machine learning model that
can run on device on a phone in the middle of a cassava field
without any network connection to actually detect
does this cassava plant have disease
and how should I treat it.
I think this is a good example of how
we want machine learning to run in lots
and lots of environments.
Lots of places in the world sometimes
you have connectivity.
Sometimes you don't.
A lot of cases you want it to run on device.
And it's really going to be the future.
You're going to have machine learning models running
on tiny microcontrollers, all kinds of things like this.
OK.
I'm going to use the remaining time to take you on a tour
through some researchy projects and then sketch how they might
fit together in the future.
So I believe what we want is we want bigger machine learning
models than we have today.
But in order to make that practical,
we want models that are sparsely activated.
So think of a giant model, maybe with 1,000 different pieces.
But you activate 20 or 30 of those pieces for any given
example, rather than the entire set of 1,000 pieces.
We know this is a property that real organisms have
in their neural systems is most of their neural capacity
is not active at any given point.
That's partly how they're so power efficient.
Right.
So some work we did a couple of years ago at this point
is what we call a sparsely gated mixture of experts layer.
And the essential idea is these pink rectangles here
are normal neural net layers.
But between a couple of neural net layers,
we're going to insert another collection
of tiny little neural nets that we call experts.
And we're going to have a gating network that's
going to learn to activate just a few of those.
It's going to learn which of those experts
is most effective for a particular kind of example.
And the expert might have a lot of parameters.
It might be pretty large matrix of parameters.
And we're going to have a lot of them.
So we have in total eight billion-ish parameters.
But we're going to activate just a couple of the experts
on any given example.
And you can see that when you learn to route things,
you try to learn to use the expert that
is most effective at this particular example.
And when you send it to multiple experts,
that gives you a signal to train the routing network,
the gating network so that it can learn that this expert is
really good when you're talking about language that
is about innovation and researchy things
like you see on the left hand side.
And this center expert is really good at talking
about playing a leading role and central role.
And the one on the right is really good at kind
of quicky adverby things.
And so they actually do develop very different kinds
of expertise.
And the nice thing about this is if you
compare this in a translation task with the bottom row,
you can essentially get a significant improvement
in translation accuracy.
That's the blue score there.
So one blue point improvement is a pretty significant thing.
We really look like one blue point improvements.
And because it has all this extra capacity,
we can actually make the sizes of the pink layers
smaller than they were in the original model.
And so we can actually shrink the amount
of computation used per word by about a factor of two,
so 50% cheaper inference.
And the training time goes way down because we just
have all this extra capacity.
And it's easier to train a model with a lot of parameters.
And so we have about 1/10 the training cost
in terms of GPU days.
OK.
We've also been doing a lot of work
on AutoML, which is this idea behind automating some
of the machine learning tasks that a machine learning
researcher or engineer does.
And the idea behind AutoML is currently
you think about solving a machine learning problem
where you have some data.
You have some computation.
And you have an ML expert sit down.
And they do a bunch of experiments.
And they kind of stir it all together
and run lots of GPU days worth of effort.
And you hopefully get a solution.
So what if we could turn this into using
more computation to replace some of the experimentation
that a machine learning--
someone with a lot of machine learning experience
would actually do?
And one of the decisions that a machine learning expert makes
is what architecture, what neural network structure
makes sense for this problem.
You know, should I use a 13 layer model or a nine layer
model?
Should it have three by three or five by five filters?
Should it have skip connections or not?
And so if you're willing to say let's try to take this
up a level and do some meta learning,
then we can basically have a model that generates models
and then try those models on the problem we actually care about.
So the basic iteration of meta learning here
is we're going to have a model generating model.
We're going to generate 10 models.
We're going to train each of those models.
And we're going to see how well they each work
on the problem we care about.
And we're going to use the loss or the accuracy of those models
as a reinforcement learning signal for the model generating
model so that we can steer away from models that didn't seem
to work very well and towards models
that seem to work better.
And then we just repeat a lot.
And when we repeat a lot, we essentially
get more and more accurate models over time.
And it works.
And it produces models that are a little strange looking.
Like they're a little more unstructured
than you might think of a model that a human
might have designed.
So here we have all these crazy skip connections.
But they're analogous to some of the ideas
that machine learning researchers themselves
have come up with in.
For example, the resonant architecture
has a more structured style of skip connection.
But the basic idea is you want information
to be able to flow more directly from the input to the output
without going through as many intermediate computational
layers.
And the system seems to have developed
that intuition itself.
And the nice thing is these models actually
work pretty well.
So if you look at this graph, accuracy
is on the y-axis for the ImageNet problem.
And computational cost of the models,
which are represented by dots here, is on the x-axis.
So generally, you see this trend where
if you have a more computationally expensive
model, you generally get higher accuracy.
And each of these black dots here
is something that was a significant amount of effort
by a bunch of top computer vision researchers or machine
learning researchers that then they
published and advanced the state of the art at the time.
And so if you apply AutoML to this problem, what you see
is that you actually exceed the frontier of the hand
generated models that the community has come up with.
And you do this both at the high end,
where you care most about accuracy
and don't care as much about computational costs
so you can get a model that's slightly more accurate
with less computational cost.
And at the low end, you can get a model
that's significantly more accurate for a very small
amount of computational cost.
And that, I think, is a pretty interesting result.
It says that we should really let computers and machine
learning researchers work together
to develop the best models for these kinds of problems.
And we've turned this into a product.
So we have Cloud AutoML as a Cloud product.
And you can try that on your own problem.
So if you were maybe a company that
doesn't have a lot of machine learning researchers,
or machine learning engineers yourselves,
you can actually just take a bunch of images in
and categories of things you want to do-- maybe you
have pictures from your assembly line.
You want to predict what part is this image of.
You can actually get a high quality model for that.
And we've extended this to things more than just vision.
So you can do videos, and language, and translation.
And more recently we've introduced something
that allows you to predict relational data
from other relational data.
You want to predict will this customer buy something given
their past orders or something.
We've also obviously continued research in the AutoML field.
So we've got some work looking at the use of evolution
rather than reinforcement learning for the search,
learning the optimization update rule,
learning the nonlinearity function rather than just
assuming we should use [INAUDIBLE]
or some other kind of activation function.
We've actually got some work on incorporating
both inference latency and the accuracy.
Let's say you want a really good model that has
to run in seven milliseconds.
We can find the most accurate model
that will run in your time budget allowed by using a more
complicated reward function.
We can learn how to augment data so that you can stretch
the amount of label data you have in interesting ways
more effectively than handwritten data augmentation.
And we can explore lots of architectures
to make this whole search process a bit more efficient.
OK.
But it's clear if we're going to try these approaches,
we're going to need more computational power.
And I think one of the truisms of machine learning
over the last decade or so is more computational
power tends to get better results
when you have enough data.
And so it's really nice that deep learning
is this really broadly useful tool
across so many different problem domains,
because that means you can start to think about specializing
hardware for deep learning but have
it apply to many, many things.
And so there are two properties that deep learning algorithms
tend to have.
One is they're very tolerant of reduced precision.
So if you do calculations to one decimal digit of precision,
that's perfectly fine with most of these algorithms.
You don't need six or seven digits of precision.
And the other thing is that they are all--
all these algorithms I've shown you are made up
of a handful of specific operations, things like matrix
multiplies, vector dot products, essentially
dense linear algebra.
So if you can build machines, computers,
that are really good at reduced precision dense linear algebra,
then you can accelerate lots of these machine learning
algorithms quite a lot compared to more general purpose
computers that have general purpose CPUs that
can run all kinds of things or even
GPUs which tend to be somewhat good at this but tend to have,
for example, higher precision than you might want.
So we started to think about building
specialized hardware when I did this kind of thought
exercise in 2012.
We were starting to see the initial success
of deep neural nets for speech recognition
and for image recognition and starting
to think about how would we deploy
these in some of our products.
And so there was this scary moment
where we realized that if speech started to work really well,
and at that time we couldn't run it
on device because the devices didn't
have enough computational power, what
if 100 million users started talking to their phones
for three minutes a day, which is not implausible if speech
starts to work a lot better.
And if we were running the speech models on CPUs,
we need to double the number of computers in Google data
centers, which is slightly terrifying to launch
one feature in one product.
And so we started to think about building these specialized
processors for the deep learning algorithms we wanted to run
and TPU V1 has been in production use
since 2015 was really the outcome of that thought
exercise.
And it's in production use based on every query you do,
on every translation you do, speech processing, image
crossing, AlphaGo use a collection of these.
This is the actual racks of machines that were
competed in the AlphaGo match.
You can see the little Go board we've
commemorated with on the side.
And then we started to tackle the bigger problem of not just
inference, which is we already have a trained model
and you just want to apply it, but how do you actually do
training in an accelerated way.
And so the second version of TPUs
are for training and inference.
And that's one of the TPU devices,
which has four chips on it.
This is TPU V3, which also has four chips on it.
It's got water cooling.
So it's slightly scary to have water in your computers,
but we do.
And then we designed these systems
to be configured together into larger configurations we
call pods.
So this is a TPU V2 pod.
This is a bigger TPU V3 pod with water cooling.
You can actually see one of the racks of this in the machine
learning dome.
And really these things actually do provide
a lot of computational power.
Individual devices with the four chips
are up to 420 teraflops have a fair amount of memory.
And then the actual pods themselves are
up to 100 petaflops of compute.
This is a pretty substantial amount of compute
and really lets you very quickly try machine
learning research experiments, train very large production
models on large data sets, and these are also
now available through our cloud products.
As of yesterday, I think we announced them to be in beta.
One of the keys to performance here
is the network interconnect between the chips in the pods
is actually your super high speed 2D
mesh with wrap around links.
That's why it's toroidal.
And that means you can essentially program this thing
as if it's a single computer.
And the software underneath the covers
takes care of distributing the computation appropriately
and can do very fast all reduced kind of operations
and broadcast operations.
And so, for example, you can use a full TPU V2 pod
to train ImageNet in 7.9 minutes versus the same problem using
eight GPUs.
You get 27 times faster training at lower cost.
The V3 pod is actually even substantially larger.
You can train an ImageNet model in scratch
in less than two minutes, more than a million images
per second in training, which is essentially the entire ImageNet
data set every second.
And you can train very large BERT language models,
for example, as I was discussing on stage
in the keynote yesterday in about 76 minutes
on a fairly large corpus of data which normally would take days.
And so that really helps make our researchers
and ML production systems more productive
by being able to experiment more quickly.
If you can run an experiment in two minutes, that's
a very different kind of science and engineering
you do than if that same experiment would
take you a day and a half.
Right.
You just think about running more experiments,
trying more things.
And we have lots of models already available.
OK.
So let's take some of the ideas we talked about
and think about how they might fit together.
So I said we want these really large models
but have them be sparsely activated.
I think one of the things we're doing wrong in machine learning
is we tend to train a machine learning model
to do a single thing.
And then we have a different problem.
We tend to train a different model to do that other thing.
And I think really we should be thinking about how can we
train models that do many, many things
and leverage the expertise that they have
in doing many things to then be able to take on a new task
and learn to do that new task more quickly and with less
data.
This is, essentially, multi task learning.
But often multi task learning in practice today
means three or four or five tasks, not
thousands or millions.
I think we really want to be thinking bigger and bolder
about really doing in the limit one model for all of the things
we care about.
And obviously, we're going to try
to train this large model using fancy ML hardware.
OK.
So how might this look?
So I imagine we've trained a model
on a bunch of different tasks.
And it's learned these different components,
which can be sometimes shared across different tasks,
sometimes independent, specialized
for a particular task.
And now a new task comes along.
So with the AutoML style reinforcement learning,
we should be able to use an RL logarithm to find pathways
through this model that actually get us
into a pretty good state for that new task,
because it hopefully has some commonalities with other things
we've already learned.
And then we might have some way to add capacity to the system
so that for a task where we really care about accuracy,
we can add a bit of capacity and start to use that for this task
and have that pathway be more specialized for that task
and therefore hopefully more accurate.
And I think that's an interesting direction to go in.
How can we think more about building a system like that
than the current kind of models we have today where
we tend to fully activate the entire model for every example
and tend to have them just for a single task?
OK.
I want to close on how we should be thinking about using machine
learning and all the different places
that we might consider using it.
And I think one of the things that I'm
really proud of as a company is that last year we published
a set of principles by which we think
about how we're going to use machine learning
for different things.
And I think these seven things when
we look at using machine learning in any of our products
or settings we think carefully about how are we actually
fulfilling these principles by using
machine learning in this way.
And I think there's more on the actual principles website
that you can go find, but I think this is really, really
important.
And I'll point out that some of these things
are evolving research areas as well as principles
that we want to apply.
So for example, number two, avoid creating or reinforcing
unfair bias.
And bias in machine learning models
is a very real problem that you get from a variety of sources.
Could be you have biased training data.
Could be you're training on real world data
and the world does itself is biased
in ways that we don't want.
And so there is research that we can apply and extend in
how do we reduce bias or eliminate it
from machine learning models.
And so this is an example of some of the work
we've been doing on bias and fairness.
But what we try to do in our use of ML models
is apply the best known practices
for our actual production use but also
advance the state of the art in understanding bias and fairness
and making it better.
And so with that, in conclusion, deep neural nets and machine
learning are really tackling some of the world's
great challenges I think.
I think we're really making progress in a number of areas.
There's a lot of interesting problems
to tackle and to still work on.
And they're going to affect not just computer science.
Right.
We're affecting many, many aspects of human endeavor
like medicine, science, other kinds of things.
And so I think it's a great responsibility
that we have to make sure that we do these things right
and to continue to push for the state of the art
and apply it to great things.
So thank you very much.
[MUSIC PLAYING]