Subtitles section Play video
Alright. Hello everybody.
Hopefully you can hear me well.
Yes?
Yes.
Great!
So, welcome to Course 6.S094.
Deep Learning for Self-Driving Cars.
We will introduce to you the methods of deep learning,
of deep neural networks using the guiding case study of building self-driving cars.
My name is Lex Fridman.
You get to listen to me for a majority of these lectures
and I am part of an amazing team with some brilliant TAs.
Would you say brilliant?
(CHUCKLES)
Dan Brown.
You guys want to stand up?
They're in the front row.
Spencer, William Angell.
Spencer Dodd and all the way in the back.
The smartest and the tallest person I know, Benedict Jenik.
Well you see there on the left of the slide is a visualization of one of the two projects that one of the two simulations, games that we'll get to go through.
We use it as a way to teach you about deep reinforcement learning but also as a way to excite you.
By challenging you to compete against others
if you wish to in a special prize yet to be announced.
Super secret prize.
So you can reach me and the TA's at deepcars@MIT.edu if you have any questions about the tutorials, about the lecture, about anything at all.
The website cars.mit.edu has the lecture content.
Code tutorials, again like today, the lectures slides for today are already up in PDF form.
The slides themselves, if you want to see them just e-mail me but there are over a gigabyte in size because they're very heavy in videos so I'm just posting the PDS.
And there will be lecture videos available a few days after the lectures were given.
So speaking of which there is a camera in the back.
This is being videotaped and recorded but for the most part the camera is just on the speaker.
So you shouldn't have to worry.
If that kind of thing worries you then you could sit on the periphery of the classroom
or maybe I suggest sunglasses and a moustache, fake mustache, would be a good idea.
There is a competition for the game that you see on the left.
I'll describe exactly what's involved
in order to get credit for the course you have to
design a neural network that drives the car just above the speed limit sixty five miles an hour.
But if you want to win, we need to go a little faster than that.
So who's this class is for?
You may be new to programming,
new to machine learning,
new to robotics,
or you're an expert in those fields but want to go back to the basics.
So what you will learn is an overview of deep reinforcement learning,
of convolutional neural networks,
recurring neural networks
and how these methods can help improve each of the components of autonomous driving -
perception, visual perception, localization, mapping, control planning and the detection of driver state.
Okay, two projects.
Code named "DeepTraffic" is the first one.
There is, in this particular formulation of it,
there is seven lanes.
It's a top view.
It looks like a game but I assure you it's very serious.
It is the agent in red,
the car in red is being controlled by a neural network and we'll explain
how you can control and design the various aspects, the various parameters of this neural network
and it learns in the browser.
So this, we're using ConvNet.JS
which is a library that is programmed by Andrej Karpathy in javascript.
So amazingly we live in a world where you can train in a matter of minutes
a neural network in your browser.
And we'll talk about how to do that.
The reason we did this
is so that there is very few requirements to get you up and started with neural networks.
So in order to complete this project for the course,
you don't need any requirements except to have a Chrome browser.
And to win the competition you don't need anything except the Chrome browser.
The second project code name "DeepTesla"
or "Tesla"
is using data from a Tesla vehicle
of the forward road way
and using end-to-end learning
taking the image and putting into convolutional neural networks
that directly maps
"or aggressor" that maps to a steering angle.
So all it takes is a single image
and it predicts a steering angle for the car.
We have data for the car itself
and you get to build a neural network
that tries to do better,
tries to steer better or at least as good as the car.
Okay.
Let's get started with the question,
with the thing that we understand so poorly at this time
because it's so shot in mystery
but it fascinates many of us.
And that is the question of: "What is intelligence?"
This is from a March 1996 Time magazine.
And the question: "Can machines think?"
is answered below with, "they already do."
So what if anything is special about the human mind?
It's a good question for 1996,
a good question for 2016,
2017 now,
and the future.
And there's two ways to ask that question.
One is the special purpose version.
Can an artificial intelligence system achieve a well defined,
specifically, formally defined finite set of goals?
And this little diagram
from a book that got me into artificial intelligence as a bright-eyed high school student
they are artificial intelligence to modern approach.
This is a beautifully simple diagram of a system.
It exists in an environment.
It has a set of sensors that do the perception.
It takes those sensors in.
It does something magical.
There's a question mark there.
And with a set of affectors acts in the world, manipulates objects in that world,
and so special purpose.
We can,
under this formulation,
as long as the environment is formally defined,
well defined;
as long as a set of goals are well defined.
As long as the set of actions,
sensors,
and the ways that the perception carries itself out as well defined.
We have good algorithms
which will talk about
that can optimize for those goals.
The question is,
if we inch along this path,
will we get closer to the general formulation,
to the general purpose version of what artificial intelligence is?
Can it achieve poorly defined,
unconstrained set of goals
with an unconstrained, poorly defined set of actions
and unconstrained, poorly defined utility functions rewards.
This is what human life is about.
This is what we do pretty well most days.
Exist in an undefined, full of uncertainty, world.
So, okay.
We can separate tasks into three different, categories,
formal tasks.
This is the easiest.
It doesn't seem so, it didn't seem so at the birth of artificial intelligence
but that's in fact true if you think about it.
The easiest is the formal tasks,
playing board games, theory improving.
All the kind of mathematical logic problems that can be formally defined.
Then there is the expert tasks.
So this is where a lot of the exciting breakthroughs have been happening
where machine learning methods,
data driven methods,
can help aid or improve on
the performance of our human experts.
This means medical diagnosis, hardware design,
scheduling,
and then there is the thing that we take for granted.
The trivial thing.
The thing that we do so easily every day when we wake up in the morning.
The mundane tasks of everyday speech,
of written language,
of visual perception,
of walking which we'll talk about in today's lecture
is a fascinatingly difficult task
on object manipulation.
So the question is that we're asking here,
before we talk about deep learning,
before we talk about the specific methods,
we really want to dig in and try to see what is it about driving,
how difficult is driving.
Is it more like chess which you see on the left there
where we can formally define a set of lanes,
a set of actions and formulate it as there's five set of actions - you can change your lane,
you can avoid obstacles.
You can formally define an obstacle.
You can the formally define the rules of the road.
Or is there something about natural language,
something similar to everyday conversation about driving
that requires a much higher degree of reasoning,
of communication,
of learning,
of existing in this under-actuated space.
Is it a lot more than just left lane,
right lane,
speed up,
slow down?
So let's look at it as a chess game.
Here's the chess pieces.
What are the sensors we get to work with on an autonomous vehicle?
And we get a lot more in-depth on this
especially with the guest speakers
who built many of these.
There's radar.
There's the Rays sensors.
Radar lidar.
They give you information about the obstacles in their environment.
They'll help localize the obstacles in the environment.
There's the visible light camera
and stereo vision that gives you texture information,
that helps you figure out not just where the obstacles are
but what they are,
helps to classify those,
has to understand their subtle movements.
Then there is the information about the vehicle itself,
about the trajectory and the movement of the vehicle that comes from the GPS
an IMU sensors.
And there is the rich state of the vehicle itself.
What is it doing?
What are all the individual systems doing
that comes from the canned network.
And there is one of the less studied
but fascinating to us on the research side is audio.
The sounds of the road
that provide the rich context
of a wet road.
The sound of a road that when it stop raining
but it's still wet,
the sound that it makes.
The screeching tire
and honking.
These are all fascinating signals as well.
And the focus of the research in our group,
the thing that's really much
under-investigated
is the internal facing sensors.
The driver,
sensing the state of the driver,
were they looking?
Are they sleepy?
The emotional state.
Are they in the seat at all?
And the same with audio.
That comes from the visual information and the audio information.
More than that.
Here are the tasks.
If you were to break into modules the tasks
of what it means to build a self-driving vehicle.
First, you want to know where you are.
Where am I.
Localization and mapping.
You want to map the external environment.
Figure out where all the different
obstacles are,
all the entities are,
and use that estimate of the environment
to then figure out where I am,
where the robot is.
Then there is scene understanding.
It's understanding not just the positional aspects
of the external environment and the dynamics of it
but also what those entities are.
Is it a car? Is it a pedestrian?
Is it a bird?
There is movement planning.
Once you have kind of figured out to the best of your abilities
your position and the position of other entities in this world,
it's figuring out a trajectory through that world.
And finally,
once you've figured out how to move about safely
and effectively through the world
it's figuring out what the human that's on board is doing
because as I will talk about
the path to a self-driving vehicle
and that is, hence, our focus on Tesla
may go through semi-autonomous vehicles.
Where the vehicle must not only drive itself
but effectively hand over control
from the car
to the human
and back.
Ok, quick history.
Well, there's a lot of fun stuff from the eighty's and ninety's but
the big breakthroughs came in the second DARPA Grand Challenge
with Stanford Stanley,
when they won the competition.
One of five cars that finished.
This was an incredible accomplishment in a desert race.
A fully autonomous vehicle was able to complete the race
in record time.
The DARPA Urban Challenge in 2007
where the task was no longer a race to the desert
but through an urban environment
and CMU's "Boss" with GM won that race
and a lot of that work went directly into the
acceptance and large major industry players
taking on the challenge of building these vehicles.
Google, now "Waymo" self-driving car.
Tesla with its "Autopilot" system and now "Autopilot 2" system.
Uber with its testing in Pittsburgh.
And there's many other companies
including one of the speakers for this course
of nuTonomy
that are driving the wonderful streets of Boston.
Ok. So let's take a step back.
We have, if we think about the accomplishments in the DARPA Challenge,
and if you look at the accomplishments of the Google self-driving car
which essentially boils the world down into a chess game.
It uses incredibly accurate sensors
to build a three dimensional map of the world,
localize itself effectively in that world
and move about that world
in a very well-defined way.
Now, what if driving...
The open question is: if driving is more like a conversation,
like in natural language conversation,
how hard is it to pass the Turing Test?
The Turing Test,
as the popular current formulation is,
can a computer be mistaken for a human being
in more than thirty percent of the time?
When a human is talking behind a veil,
having a conversation with their computer or a human,
can they mistake the other side of that conversation
for being a human when it's in fact a computer.
And the way you would, in a natural language,
build a system that has successfully passes the Turing Test is,
the natural language processing part
to enable it to communicate successfully?
So, general language and interpret language,
then you represent knowledge the state of the conversation
transferred over time.
And the last piece and this is the hard piece,
is the automated reasoning,
is reasoning.
Can we teach machine learning methods to reason?
That is something that will propagate through our discussion
because as I will talk about the various methods,
the various deep learning methods,
neural networks are good at learning from data
but they're not yet, there is no good mechanism for reasoning.
Now reasoning could be just something
that we tell ourselves we do to feel special.
Better to feel like we're better than machines.
Reasoning may be simply
something as simple as learning from data.
We just need a larger network.
Or there could be a totally different mechanism required
and we'll talk about the possibilities there.
Yes.
(Inaudible question from one of the attendees)
No, it's very difficult to find these kind of situations in the United States.
So the question was,
for this video, is it in the United States or not?
I believe it's in Tokyo.
So India, as is a few European countries, are much more towards the direction
of natural language versus chess.
In the United States, generally speaking, we follow rules more concretely.
The quality of roads is better.
The marking on the roads is better.
So there's less requirements there.
(Inaudible question from one of the attendees)
These cars are are driving on one side?
I see.
I just- Okay, you're right.
It is because, yeah-
So, but it's certainly not the United States.
I spent quite a bit of googling
trying to find in the United States and it is difficult.
So let's talk about
the recent breakthroughs in machine learning
and what is at the core of those breakthroughs
is neural networks
that have been around for a long time
and I will talk about what has changed.
What are the cool new things
and what hasn't changed
and what are its possibilities.
But first a neuron, crudely,
is a computational building block of the brain.
I know there's a few folks here, neuroscience folks,
this is hardly a model.
It is mostly an inspiration
and so the human neuron
has inspired the artificial neuron
the computational building block of a neural network,
of an artificial neural network.
I have to give you some context.
These neurons,
for both artificial and human brains,
are interconnected.
And the human brain,
there's about, I believe 10,000 outgoing connections from every neuron
on average and they're interconnected to each other,
are the largest current, as far as I'm aware,
artificial neural network, has 10 billion of those connections.
Synapses.
Our human brain, to the best estimate that I'm aware of,
has 10,000X that.
So one hundred to one thousand trillion synapses.
Now what is an artificial neuron?
That is the building block of a neural network.
It takes a set of inputs.
It puts a weight on each of those inputs, sums them together,
applies a bias value on each neuron
and using an activation function
that takes its input,
that sum plus the bias and it squishes it together
to produce a zero to one signal.
And this allows us a single neuron
to take a few inputs and produces an output
a classification for example, a zero one.
And then we'll talk about, simply, it can
serve as a linear classifier
so it can draw a line.
It can learn to draw a line between, like what you'd seen here,
between the blue dots and the yellow dots.
And that's exactly what we'll do in the iPython Notebook that I'll talk about
but the basic algorithm is you initialize the weights
on the inputs and you compute the output.
You perform this previous operation I talked about sum up
and compute the output.
And if the output does not match the ground truth,
The expected output, the output it should produce,
the weights are punished accordingly
and will talk through a little bit of the math of that.
And this process is repeated until the perceptron does not make any more mistakes.
Now here's the amazing thing about neural networks.
There are several and I'll talk about them.
One on the mathematical side is the universality of neural networks
with just a single layer if you stack them together, a single hidden layer,
the inputs on the left, the outputs on the right.
And in the middle there is a single hidden layer,
it can closely approximate any function. Any function.
So this is an incredible property
that with a single layer any function you could think of,
that you could think of driving as a function.
It takes its input,
the world outside as output
to control the vehicle.
There exists a neural network out there that can drive perfectly.
It's a fascinating mathematical fact.
So we can think of this then these functions as a special purpose function,
special purpose intelligence.
You can take, say as input,
the number of bedrooms, the square feet,
the type of neighborhood.
Those are the three inputs.
It passes that value through to the hidden layer.
And then one more step.
It produces the final price estimate for the house or for the residence.
And we can teach a network to do this pretty well in a supervised way.
This is supervised learning.
You provide a lot of examples
where you know the number of bedrooms, the square feet,
the type of neighborhood
and then you also know the final price of the house or the residence.
And then you can, as I'll talk about through a process of back propagation,
teach these networks to make this prediction pretty well.
Now some of the exciting breakthroughs recently
have been in the general purpose intelligence.
This is is from Andrej Karpathy who is now at OpenAI.
I would like to take a moment here to try to explain how amazing this is.
This is a game of "pong".
If you're not familiar with "pong", there are two paddles
and you're trying to bounce the ball back
and in such a way that prevents the other guy from bouncing the ball back at you.
The artificial intelligence agent is on the right in green
and up top is the score 8-1.
Now this takes about three days to train
on a regular computer, this network.
What is this network doing?
It's called the Policy Network.
The input is the raw pixels.
There's slightly a process and also you take the difference between two frames
but it's basically the raw pixel information.
That's the input.
There's a few hidden layers
and the output is the single probability of moving up.
That's it. That's the whole system and what it's doing is, it learns.
You don't know at any one moment,
you don't know what the right thing to do is.
Is it to move up? Is it's moved down?
You only know what the right thing to do is
by the fact that eventually you win or lose the game.
So this is the amazing thing here is, there's no supervised learning.
There's no universal fact about anyone stay being good or bad.
And anyone actually being good or bad in the state
but if you punish or reward every single action you took,
every single action you took, for an entire game
based on the result. So no matter what you did, if you won the game,
the end justifies the means.
If you won the game, every action you took in every every action state pair gets rewarded.
If you lost the game, it gets punished.
And this process, with only two hundred thousand games
where the system just simulates the games, it can learn to beat the computer.
This system knows nothing about "pong", nothing about games,
this is general intelligence.
Except for the fact, that it's just a game "pong".
And I will talk about how this can be extended further,
why this is so promising
and why we should proceed with caution.
So again, there's a set of actions you take up, down, up, down,
based on the output of the network.
There's a threshold given the probability of moving up,
you move up or down based on the output of the network.
And you have a set of states
and every single state action pair is rewarded if there's a win
and it's punished if there's a loss.
When when you go home, think about how amazing that is
and if you don't understand why that's amazing,
spend some time on it.
It's incredible.
(Inaudible question from one of the attendees)
Sure, sure thing.
The question was: "What is supervised learning?
What is unsupervised learning? What's the difference?"
So supervised learning is,
when people talk about machine learning they mean supervised learning most of the time.
Supervised learning is
learning from data, is learning from example.
When you have a set of inputs and a set of outputs that you know are correct or
called Ground Truth.
So you need those examples, a large amount of them,
to train any of the machine learning algorithms
to learn to then generalize that to future examples.
Actually, there's a third one called Reinforcement Learning where the Ground Truth is sparse.
The information about when something is good or not,
the ground truth only happens every once in a while, at the end of the game.
Not every single frame.
And unsupervised learning is when you have no information
about the outputs.
They are correct or incorrect.
And it is the excitement of the deep learning community is unsupervised learning,
but it has achieved no major breakthroughs at this point.
I'll talk about what the future of deep learning is
and a lot of the people that are working in t he field are excited by it.
But right now, any interesting accomplishment has to do with supervised learning.
(Partially inaudible question from one of the attendees)
And the wrong one is just has the [00:33:29] (Inaudible) solution like looking at the philosophy.
So basically, the reinforcement learning here is learning from somebody who has certain hopes
and how can that be guaranteed that it would generalize to somebody else?
So the question was this:
the green paddle learns to play this game successfully
against this specific one brown paddle operating under specific kinds of rules.
How do we know it can generalize to other games, other things and it can't.
But the mechanism by which it learns generalizes.
So as long as you let it play,
as long as you let it play in whatever world you wanted it to succeed in long enough,
it will use the same approach to learn to succeed in that world.
The problem is this works for worlds you can simulate well.
Unfortunately, one of the big challenges of neural networks
is they're not currently efficient learners.
We need a lot of data to learn anything.
Human beings need one example often times
and they learn very efficiently from that one example.
And again I'll talk about that as well, it's a good question.
So the drawbacks of neural networks.
So if you think about the way a human being would approach this game,
this game of "pong", it would only need a simple set of instructions.
You're in control of a paddle and you can move it up and down.
And your task is to bounce the ball past the other player controlled by AI.
Now the human being would immediately, they may not win the game
but they would immediately understand the game
and would be able to successfully play it well enough
to pretty quickly learn to beat the game.
But they would need to have a concept of control.
What it means to control a paddle, need to have a concept of a paddle,
need to have a concept of moving up and down
and a ball and bouncing,
they have to know, they have to have at least a loose concept of real world physics
that they can then project that real world physics on to the two dimensional world.
All of these concepts are concepts that you come to the table with.
That's knowledge.
And the kind of way you transfer that knowledge from your previous experience,
from childhood to now when you come to this game,
that something is called reasoning.
Whatever reasoning means.
And the question is whether through this same kind of process,
you can see the entire world as a game of "pong"
and reasoning is simply the ability to simulate that game in your mind
and learn very efficiently, much more efficiently, than 200,000 innovations.
The other challenge of deep neural networks and machine learning broadly
is you need big data and efficient learners as I said.
And that data also need to be supervised data.
You need to have Ground Truth which is very costly for annotation.
A human being looking at a particular image, for example,
and labeling that as something as a cat or dog,
whatever objects is in the image,
that's very costly.
And particularly for neural networks there's a lot of parameters to tune.
There's a lot of hyper-parameters.
You need to figure out the network structure first.
How does this network look, how many layers?
How many hidden nodes?
What type of activation function for each node?
There's a lot of hyper-parameters there
and then once you've built your network,
there's parameters for how you teach that network.
There's learning rate, loss function - meaning bad size -
number of training iterations, gradient updates moving
and selecting even the optimizer with which
you solve the various differential equations involved.
It's a topic of many research paper, certainly it's rich enough for research papers,
but it's also really challenging.
It means you can't just pop the network down
it will solve the problem generally.
And defining a good lost function,
or in the case of "pong" or games,
a good reward function is difficult.
So here's a game, this is a recent result from OpenAI,
I'm teaching a network to play the game of coast runners.
And the goal of coast runners
is you're in a boat the task is to go around the track
and successfully complete a race against other people you're racing against.
Now this network is an optimal one.
And what is figured out that actually in the game,
it gets a lot of points for collecting certain objects along the path.
So you see it's figured out to go in a circle and collect those those green turbo things.
And what is figured out is you don't need to complete the game to earn the award.
And despite being on fire and hitting the wall and going through this whole process,
it's actually achieved at least the local optima
given the reward function of maximizing the number of points.
And so it's figured out a way to earn a higher reward
while ignoring the implied bigger picture goal of finishing the race
which us as humans understand much better.
This raises, for self-driving cars, ethical questions.
Besides other quick questions.
(CHUCKLING)
We could watch this for hours and it will do that for hours and that's the point:
It's hard to teach, it's hard to encode the formally defined utility function under which
an intelligent system needs to operate.
And that's made obvious even in a simple game.
And so what is - Yup, question.
(Inaudible question from one of the attendees)
So the question was: "what's an example of a local optimum that an autonomous car,
similar to the cost racer, what would be the example in the real world for an autonomous vehicle?
And it's a touchy subject.
But it would certainly have to be involved
the choices we make under near crashes and crashes.
The choices a car makes want to avoid.
For example, if there's a crash imminent
and there's no way you can stop
to prevent the crash, do you keep the driver safe
or do you keep the other people safe.
And there has to be some, even if you don't choose to acknowledge it,
even if it's only in the data and the learning that you do,
there's an implied reward function there.
And we need to be aware of that reward function is
because it may find something.
Until you actually see it, we won't know it.
Once we see it, we realize that oh that was a bad design
and that's the scary thing.
It's hard to know ahead of time what that is.
So the recent breakthroughs from deep learning came several factors.
First is the compute, Moore's Law.
CPUs are getting faster, hundred times faster, every decade.
Then there's GPU use.
Also the ability to train neural networks and GPUs and now ASICs
has created a lot of capabilities in terms of energy efficiency
and being able to train larger networks more efficiently.
Well, first of all in the in the 21st Century there's digitized data.
There's larger data sets of digital data
and now there is that data is becoming more organized,
not just vaguely available data out there on the internet,
it's actual organized data sets like Imagenet.
Certainly for natural languages there's large data sets.
There is the algorithm innovations, Backprop.
Back propagation, Convolutional Neural Networks, LSTMs.
All these different architectures for dealing with specific types of domains and tasks.
There is the huge one, is infrastructure.
It's on the software and the hardware side.
There's Git, Ability to Share and Open Source Way software.
There are pieces of software that make robotics and make machine learning easier.
ROS, TensorFlow.
There is Amazon Mechanical Turk
which allows for efficient, cheap annotation of large scale data sets.
As AWS and the cloud hosting, machine learning hosting the data and the compute.
And then there's a financial backing of large companies - Google, Facebook, Amazon.
But really nothing is changed.
There really has not been any significant breakthroughs.
Convolutional networks have been around since the 90s,
neural networks has been around since the 60s.
There's been a few improvements
but the hope is, that's in terms of methodology,
the compute has really been the work horse.
The ability to do the hundred fold improvement every decade,
holds promise and the question is whether that reasoning thing I talked about,
all you need is a larger network.
That is the open question.
Some terms for deep learning.
First of all deep learning, is a PR term for neural networks.
It is a term for utilising deep neural networks
for neural networks to have many layers.
It is symbolic term for the newly gained capabilities that compute has brought us.
That training on GPUs have brought us.
So deep learning is a subset of machine learning.
There's many other methods that are still effective.
The terms that will come up in this class is, first of all, Multilayer Perceptron (MLP)
Deep neural networks (DNN), Recurrent neural networks (RNN),
LSTM (Long Short-Term Memory) Networks, CNN and ConvNet (Convolutional neural networks),
Deep Belief Networks.
And the operational come up is Convolutional, Pooling, Activation functions and Backpropagation.
Yes, you've got a question?
(Inaudible question from one of the attendees)
So the question was, what is the purpose of the different layers in neural network?
What is the need of one configuration versus another?
So a neural network, having several layers,
it's the only thing you have an understanding of, is the inputs and the outputs.
You don't have a good understanding about what these layer does.
They are mysterious things, neural networks.
So I'll talk about how, with every layer, it forms a higher level.
A higher order representation of the input.
So it's not like the first layer does localization,
the second layer does path planning,
the third layer does navigation - how you get from here to Florida -
or maybe it does, but we don't know.
So we know we're beginning to visualize neural networks for simple tasks
like for ImageNet classifying cats versus dogs.
We can tell what is the thing that the first layer does, the second layer, the third layer
and we look at that.
But for driving, as the input provide just the images the output the steering.
It's still unclear what you learned
partially because we don't have neural networks that drive successfully yet.
(Points to a member of the class)
(Inaudible question)
So the question was, does a neural network generate layers over time, like does it grow it?
That's one of the challenges, that a neural network is pre-defined.
The architecture, the number of nodes, the number of layers. That's all fixed.
Unlike the human brain where the neurons die and are born all the time.
A neural Network is pre-specified, that's it.
That's all you get and if you want to change that,
you have to change that and then retrain everything.
So it's fixed.
So what I encourage you is to proceed with caution
because there's this feeling when you first teach a network with very little effort,
how to do some amazing tasks like classify a face versus non-face,
or your face versus other faces or cats versus dogs, its an incredible feeling.
And then there's definitely this feeling that I'm an expert
but what you realize is we don't actually understand how it works.
And getting it to perform well for more generalized task,
for larger scale data sets, for more useful applications,
requires a lot of hyper-parameter tuning.
Figuring out how to tweak little things here and there
and still in the end, you don't understand why it work so damn well.
So deep learning, these deep neural network architectures is representation learning.
This is the difference between traditional machine learning methods where,
for example, for the task of having an image here is the input.
The input to the network here is on the bottom, the output up on top,
and the input is a single image of a person in this case.
And so the input, specifically, is all the pixels in that image.
RGB, the different colors of the pixels in the image.
And over time, what a network does is build a multiverse solutional representation of this data.
The first layer learns the concept of edges, for example.
The second layer starts to learn composition of those edges, corners, contours.
Then it starts to learn about object parts.
And finally, actually provide a label for the entities that are in the input.
And this is the difference in traditional machine learning methods
where the concepts like edges and corners and contours
are manually pre-specified by human beings, human experts, for that particular domain.
And representation matters because figuring out a line
for the Cartesian coordinates of this particular data set
where you want to design a machine learning system
that tells the difference between green triangles and blue circles is difficult.
There is no line that separates them cleanly.
And if you were to ask a human being, a human expert in the field.
to try to draw that line they would probably do a Ph. D. on it and still not succeed.
But a neural network can automatically figure out
to remap that input into polar coordinates
where the representation is such that it's an easily, linearly separable data set.
And so, deep learning is a subset of representation learning,
is a subset of machine learning and a key subset artificial intelligence.
Now, because of this,
because of its ability to compute an arbitrary number of features
that are at the core of the representation.
So if you are trying to detect a cat in an image,
you're not specifying 215 specific features of cat ears and whiskers and so on
that a human expert will specify you allow and you'll know
it discover tens of thousands of such features,
which maybe for cats you are an expert
but for a lot of objects you may never be able to sufficiently provide the features
which successfully will be used for identifying the object.
And so, this kind of representation learning,
one is easy in the sense that all you have to provide is inputs and outputs.
All you need to provide is a data set the care about without [00:53:39] features.
And two, because of it's ability to construct arbitrarily sized representations,
deep neural networks are hungry for data.
The more data we give them,
the more they are able to learn about this particular data set.
So let's look at some applications.
First, some cool things that deep neural networks have been able to accomplish up to this point.
Let me go through them.
First, the basic one.
AlexNet is for- ImageNet is a famous data set and a competition of classification,
localization where the task is given an image,
identify what are the five most likely things in that image
and what is the most likely and you have to do so correctly.
So on the right, there's an image of a leopard
and you have to correctly classify that that is in fact the leopard.
So they're able to do this pretty well given a specific image.
Determine that it's a leopard.
And we started, what's shown here on the x-axis is years
on the y-axis is error in classification.
So starting from 2012 on the left with AlexNet and today
the errors decreased from 16% and 40% before then with traditional methods
have decreased to <4%.
So human level performance,
if I were to give you this picture of a leopard
is a 4% of those pictures of leopards you would not say it's a leopard.
That's human level performance.
So for the first time in 2015, convolutional neural networks are performed human beings.
That in itself is incredible. That is something that seemed impossible.
And now is because it's done is not as impressive.
But I just want to get to why this is so impressive
because computer vision is hard.
Now we as human beings have evolved visual perception over millions of years,
hundreds of millions of years.
So we take it for granted but computer vision is really hard, visual perception is really hard.
There's illumination variability.
So it's the same object.
The only way we are telling you a thing is from the shade, the reflection of light from that surface.
It could be the same object with drastically, in terms of pixels,
drastically different looking shapes and we still know it's the same object.
There is post-variability in occlusion.
Probably my favorite caption for an image
for a figure in a academic paper is deformable and truncated cat.
These are pictures, you know cats are famously deformable.
They can take a lot of different shapes.
(LAUGHTER)
Its arbitrary poses are possible so you have to have computer vision
to know it's still the same objects, still the same class of objects,
given all the variability in the pose and occlusions is a huge problem.
We still know it's an object.
We still know it's a cat even when parts of it are not visible.
And sometimes large parts of it are not visible.
And then there's all the inter-class variability.
Inter-class, all of these on the top two rows are cats.
Many of them look drastically different.
And the top bottom two rows are dogs also look drastically different.
And yet some of the dogs look like cats,
some of the cats look like dogs.
And as human beings are pretty good at telling the difference
and we want computer vision to do better than that.
It's hard. So how is this done? This is done with convolutional neural networks.
The input to which is a raw image.
Here's an input on the left of a number three
and I'll talk about through convolutional layers
that image is processed past through convolutional layers
maintain spatial information.
On the output, in this case predicts which of the images
what number is shown in the image.
0, 1, 2 through 9.
And so, these networks, everybody's using the same kind of network to determine exactly that.
Input is an image, output is a number.
And in the case of probability, that is a leopard. What is that number?
Then there is segmentation built on top of these convolution neural networks
where you chop off the end and convolutionise the network.
You chop off the end where the output is a heat map.
So you can have, instead of a detector for a cat, you can do a cat heat map
where it's the part of the image, the output heat map gets excited,
the neurons in that output get excited
in the spatially excited, in the parts of the image that contain a tabby cat.
And this kind of process can be used to segment the image into different objects, a horse.
So the original input on the left is a woman on a horse
and the output is a fully segmented image of knowing where is the woman, where is the horse.
And this kind of process can be used for object detection
which is the task of detecting an object in an image.
Now the traditional method with convolutional neural networks
and in general computer vision is the sliding window approach.
We have a detector, like the leopard detector, where you slide through the image
to find where in that image is the leopard.
This, the segmenting approach,
the R-CNN approach, is efficiently segmenting the image
in such a way that it can propose different parts of the image
that are likely to have a leopard, or in this case a cowboy,
and that drastically reduces the computational requirements of the object detection task.
And so these networks, this is currently one of the best networks for the ImageNet task of localization
is the Deep residual networks. They're deep. So VGG-19 is one of the famous ones.
You started to get above twenty layers in many cases,
thirty four layers is the rise in that one.
So the lesson there is, the deeper you go the more representation power you have,
the higher accuracy but you need more data.
Other applications, colorization of images.
So this again, input is a single image and output is a single image.
So you can take a black and white video from a film, from an old film,
and recolor it. And all you need to do to train that network in the supervised way
is provide modern films and convert them to grayscale.
So now you have arbitrarily sized data sets, data sets of gray scale to color.
And you're able to, with very little effort on top of it, to successfully
well, somewhat successful recolor images.
Again, Google Translate does image translation in this way, image to image.
It first perceives, here in German I believe, famous German correct me if I'm wrong,
dark chocolate written in German on a box.
So this can take this image, detect different letters convert them to text,
translate the text and then using the image to image mapping
map the letters, the translated letters, back onto the box and you could do this in real time on video.
So what we've talked about up to this point on the left are "vanilla" neural networks,
convolutional neural networks, that map a single input, a single output,
a single image to a number, single image another image.
Then there is recurrent neural networks, the map.
This is the more general formulation,
they map a sequence of images
or a sequence of words
or a sequence of any kind to another sequence.
And these networks are able to do incredible things with natural language,
with video, and any type of series of data.
For example, you can convert text to hand written digits, with hand written text.
Here, you type in and you can do this online, type in deep learning for self-driving cars
and it will use an arbitrary handwriting style to generate the words "deep learning for self-driving cars".
This is done using recurring neural networks.
We can also take Char-RNNs they're called, it's character level recurring neural networks
that train on a data set
an arbitrary text data set and learn to generate text one character at a time.
So there is no preconceived syntactical semantic structure that's provided to the network.
It learns that structure.
So for example, you can train it on Wikipedia articles like in this case.
And it's able to generate successfully not only text that makes some kind of grammatical sense at least
but also keep perfect syntactic structure for Wikipedia, for Markdown, editing,
for late tack editing and so on.
This text as "naturalism and decision for the majority of Arab countries capitalide."
Whatever that means, "was grounded by the Irish language by John Clare," and so on.
These are sentences. If you didn't know better, that might sound correct.
And it does so and you pause one character at a time so these aren't words being generated.
This is one character, you start with the beginning three letters "nat",
you generate "u" completely without knowing of the word naturalism.
This is incredible.
You can do this to start a sentence and let the neural network complete that sentence.
So for example if you start the sentence with "life is" or "life is about" actually,
it will complete it with a lot of fun things. "The weather." "Life is about kids."
"Life is about the true love of Mr Mom", "is about the truth now."
And this is from [01:05:59], the last two,
if you start with "the meaning of life," it can complete that with
"the meaning of life is literary recognition" may be true for some of us here.
Publish or perish.
And "the meaning of life is the tradition of ancient human reproduction."
(LAUGHTER)
Also true for some of us here. I'm sure.
Okay, so what else can you do?
You can, this has been very exciting recently is image capture recognition. No, generation, I'm sorry.
Image capture generation is important for large data sets of images.
What we want to be able to determine what's going on inside those images.
Specially for search, if you want to find a man sitting in a college with a dog,
you type it into Google and it's able to find that.
So here shown in black text a man sitting on a couch with a dog is generated by the system.
A man sitting in a chair with a dog in his lap is generated by a human observer.
And again these annotations are done by detecting the different obstacles,
the different objects in the scene.
So segmenting the scene detecting on the right there's a woman, a crowd, a cat,
a camera, holding, purple.
All of these words are being detected then a syntactically correct sentence is generated,
a lot of them, and then you order which sentence is the most likely.
And in this way you can generate very accurate labeling of the images,
captions for the images.
And you can do the same kind of process for image question answering.
You can ask how many for quantity, how many chairs are there?
You can ask about location, where are the ripe bananas?
You can ask about the type of object.
What is the object in the chair? It's a pillow.
And these are, again, using the recurring neural networks.
You could do the same thing with video captions generation,
video captions description generation.
So looking at a sequence of images as opposed to just a single image.
What is the action going on in this situation?
This is the difficult task. There's a lot of work in it, in this area.
On the left is correct descriptions of a man is do stunts on his bike
or a herd a zebra are walking in the field and on the right,
there's a small bus running into a building.
You know it's talking about relevant entities but just doing an incorrect description.
A man is cutting a piece of a pair of a paper.
So the words are correct. Perhaps, but so you're close, but mostly are.
One of the interesting things
you can do with a recurring neural networks
is if you think about the way we look at images, human beings look at images,
is we only have a small phobia with which we focus in a scene.
So right now you're periphery is very distorted.
The only thing, if you're looking at the slides, you're looking at me
that's the only thing that's in focus.
Majority of everything else is out of focus.
So we can use the same kind of concept to try to teach a neural network to steer around the image.
Both for perception and generation of those images.
This is important first on the general artificial intelligence point
of it being just fascinating that we can selectively steer our attention
but also it's important for things like drones.
They have to fly at high speeds in an environment
where three hundred plus frames a second, you have to make decisions.
So you can't possibly localize yourself or perceive the world around yourself successfully
if you have to interpret the entire scene.
So we can do is you can steer, for example here shown, is reading a house number
by steering around an image.
You can do the same task for reading and for writing.
So reading numbers here, and this data set on the left, is reading numbers.
We can also selectively steer a network around an image to generate that image
starting with a blurred image first and then getting more and more higher resolution
as the steering goes on.
Work here at MIT is able to map video to audio.
So head stuff for the drumstick silent video and able to generate the sound
that would drumstick hitting that particular object makes.
So you can get texture information from that impact.
So here is the video of a human soccer player playing soccer
and a state-of-the-art machine playing soccer.
And, well let me give it some time,
to build up.
(LAUGHTER)
Okay. So soccer, we take this for granted, but walking is hard.
Object manipulation is hard. Soccer is harder than chess for us to do much harder.
On your phone now, you can have a chess engine that beats the best players in the world.
And you have to internalize that because the question is,
this is a painful video, the question is: where does driving fall?
Is it closer to chess or is it closer soccer?
For those incredible, brilliant engineers that worked on the most recent DARPA challenge
this would be a very painful video to watch, I apologize.
This is a video from the DARPA Challenge
(LAUGHTER)
of robots struggling
with basic object manipulation and walking tasks.
So it's mostly a fully autonomous navigation task.
(LAUGHTER)
Maybe I'll just let this play for a few moments to let it internalize how difficult this task is,
of balancing, of planning in an underactuated way.
We don't have full control of everything.
When there is a delta between your perception of what you think the world is and what reality is.
So there, a robot was trying to turn an object that wasn't there.
And this is an MIT entry that actually successfully, I believe, gotten points for this
because it got into that area
(LAUGHTER)
but as a lot of the teams talked about the hardest part,
So one of the things the robot had to do is get into a car and drive it and get out of the car.
And there's a few other manipulation task like walking on unsteady ground,
it had to drill a hole through a wall.
All these tasks and what a lot of teams said is the hardest part, the hardest task of all of them,
is getting out of the car.
So it's not getting into the car, it's this very task you saw now is the robot getting out of the car.
These are things we take for granted.
So in our evaluation of what is difficult about driving,
we have to remember that some of those things we may take for granted
in the same kind of way that we take walking for granted, this is more of X paradox.
Will Hans Moravec from CMU, let me just quickly read that quote:
"Encoded in the large highly evolved sensory motor portions of the human brain
is billions of years of experience about the nature of the world and how to survive in it."
So this is data. This is big data. Billions of years and abstract thought which is reasoning.
The stuff we think is intelligence is perhaps
less than one hundred thousand years of data old.
We haven't yet mastered it and so,
I'm sorry I'm asserting my own statements in the middle of a quote,
but it's been very recent that we've learned how to think.
And so we respected perhaps more than the things we take for granted
like walking, the visual perception and so on but those may be strictly a matter of data,
data and training time and network size.
So walking is hard.
The question is how hard is driving?
And that's an important question because the margin of error is small.
One, there's 1 fatality per 100 million miles.
That's the number of people that die in car crashes every year,
1 fatality per 100 million miles.
That's a point 0.000001% margin of error.
That's through all the time you spend on the road, that is the error you get.
More impressed with ImageNet being able to classify a leopard, a cat or a dog
at above human level performance but this is the margin of error we get with driving.
And we have to be able to deal with snow, with heavy rain, with big open parking lots,
with parking garages, any pedestrians that behaves irresponsibly as rarely as that happens
or just some predictably, again especially in Boston, reflections.
The ones especially some things you don't think about:
the lighting variations that blind the cameras.
(Inaudible question from one of the attendees)
The question was if that number changes, if you look at just crashes, the fatalities per crash.
So one of the big things is that cars have gotten really good at crashing and not hurting anybody.
So the number of crashes is much, much larger than the number of fatalities
which is a great thing, we've built safer cars.
But still, you know even one fatality is too many.
So this is one that Google self-driving car team
is quite open about their performance since hitting public road,
this is from a report that shows the number of times
the driver disengaged
the car gives up control,
that it asked the driver to take control back
or the driver takes control back by force.
Meaning that they're unhappy with the decision that the car was making
or it was putting the car or other pedestrians or other cars in unsafe situations.
And so, if you see over time there's been a total
from 2014 to 2015
there's been a total of 341 times on beautiful San Francisco roads
and I say that seriously because the weather conditions are great there,
341 times that the driver had to elect to control back.
So it's a work in progress.
And let me give you something to think about here.
This, with neural networks is a big open question.
The question of robustness.
So this is an amazing paper, I encourage people to read it.
There's a couple of papers around this topic.
Deep neural networks are easily fooled.
So here are 8 images where, if given to a neural network as input,
a convolutional neural network as input, the network with higher than 99.6% confidence says
that the image, for example the top left, as a robin.
Next to is a cheetah, then an armadillo, a panda, an electric guitar,
a baseball, a starfish, a king penguin.
All of these things are obviously not in the images.
So the networks can be fooled with noise.
More importantly, practically for the real world, adding just a little bit of distortion,
a little bit of noise distortion to the image, can force the network to produce a totally wrong prediction.
So here's an example, there's 3 columns,
correct image classification, the slight addition of distortion
and the resulting prediction of an ostrich for all three images on the left
and a prediction of an ostrich for all three images on the right.
This ability to fool networks easily brings up an important point.
And that point is that there has been a lot of excitement
about neural networks throughout their history.
There's been a lot of excitement about artificial intelligence throughout its history
and not coupling that excitement, not granting that excitement, in the reality
the real challenges around that has resulted in in crashes, in A.I. winters when funding dried out
and people became hopeless in terms of the possibilities of artificial intelligence.
So here is the 1958 New York Times article that said the Navy revealed the embryo of an electronic computer today.
This is when the first perceptron that I talked about
was implemented in hardware by Frank Rosenblatt.
It took 400 pixel image input and it provided a single output.
Weights were encoded in the hardware potentiometers
and waves were updated with electric motors.
Now New York Times wrote, the Navy revealed the embryo vanilla electronic computer today
that expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.
Dr. Frank Rosenblatt, a research psychologist at the Cornell Aeronautical Laboratory in Buffalo,
said perceptrons might be fired to the planets as mechanical space explorers.
This might seem ridiculous but this is the general opinion of the time.
And as we know now, perceptrons cannot even separate a non-linear function.
They're just linear classifiers.
And so this led to 2 major A.I. winters in the 70s, in the late 80s and early 90s.
The Lighthill Report, in 1973 by the UK government, said there are no part of the field
of discoveries made so far produced the major impact that was promised.
So if the hype builds beyond the capabilities of our research,
reports like this will come and they have the possibility of creating another A.I. winter.
So I want to pare the optimism, some of the cool things we'll talk about in this class,
with the reality of the challenges ahead of us.
The focus of the research community, this is some of the key players in deep learning,
what are the things that are next for deep learning, the five year vision?
We want to run on smaller, cheaper mobile devices.
We want to explore more in the space of unsupervised learning as I mentioned
and reinforcement learning.
We want to do things that explore the space of videos more,
the recurring neural networks, like being able to summarize videos or generate short videos.
One of the big efforts, especially in the companies we do in large data,
is multi-modal learning.
Learning from multiple data sets with multiple sources of data.
And lastly, making money from these technologies.
There's a lot of this despite the excitement.
There has been an inability for the most part to make serious money
from some of the more interesting parts of deep learning.
And while I got made fun of by the TAs for including this slide
because it's shown in so many sort of business type lectures,
but it is true that we're at the peak of a hype cycle
and we have to make sure be given the large amount of hype and excited there is,
we proceed with caution.
One example of that, let me mention, is we already talked about spoofing the cameras.
Spoofing the cameras with a little bit of noise.
So if you think about it, self-driving vehicles operate with a set of sensors
and they rely on those sensors to convey to accurately capture that information.
And what happens, not only when the world itself produces noisy visual information,
but what if somebody actually tries to spoof that data.
One of the fascinating things have been recently done is spoofing of LIDAR.
So these LIDAR is a range sense that gives a 3D-point cloud of the objects in the external environment.
And you're able to successfully do a replay attack where you have the car
see people in other cars around it when there's actually nothing around it.
In the same way that you can spoof a camera to see things that are not there.
A neural network.
So let me run through some of the libraries that we'll work with
and they're out there that you my work with if you proceed with deep learning.
TensorFlow, that is the most popular one these days.
It's heavily backed and developed by Google.
It's primarily a python interface and is very good at operating on multiple GPUs.
There's Keras and also TF Learn and TF Slim which are libraries that operate on top of TensorFlow
that make it slightly easier, slightly more user friendly interfaces, to get up and running.
Torch, if you're interested to get in at the lower level
tweaking of the different parameters of neural networks
creating your own architectures.
Torch is excellent for that with it's own Lua interface.
Lua's a programming language and heavily backed by Facebook.
There is the old school "theano" which is what I started on a lot of people early on,
in deep learning started on, as one of the first libraries that supported
ahead came with GPU support.
It definitely encourages lower level tinkering, has a python interface.
And many of these, if not all, rely on Nvidia's library
for doing some of the low level computations involved with training these neural networks on Nvidia GPUs.
"mxnet" heavily supported by Amazon and they have officially recently announced
that they're going to be, their AWS, is going to be all in on the mxnet.
Neon, recently bought by Intel, started out as a manufacturer of neural network chips
which is really exciting and it performs exceptionally well.
I hear good things.
Caffe, started in Berkeley, also was very popular in Google before Tensorlow came out.
It's primarily designed for computer vision with ConvNet's
but has now expanded to all of the domains.
There is CNTK, used to be known and now called the Microsoft Cognitive Toolkit.
Nobody calls it that still I'm aware of.
It says multi GPU support, has its own brain script custom language
as well as other interfaces.
And we'll get to play around in this class is, amazingly, deep learning in the browser, right.
Our favorite is ConvNetJS, what you use, built by Andrej Karpathy from Stanford now OpenAI.
It's good for explaining the basic concept of neural networks.
It's fun to play around with. All you need is a browser and some very few requirements.
It can't leverage GPUs, unfortunately.
But for a lot of things that we're doing, you don't need GPUs.
You'd be able to train a network with very little and relatively efficiently without the [01:30:15] GPUs.
It has full support for CNNs, RNNs and even deeper reinforcement learning.
Keras.js, which seems incredible, we try to use for this class.
It has GPU support so it runs in the browser with GPU support
with Open GL or however it works magically
but we're able to accomplish a lot of things we need without the use of GPUs.
It's incredible to live in a day and age when it literally, as I'll show on the tutorials,
it takes just a few minutes to get started with building your own neural network
that classifies images and a lot of these libraries are friendly in that way.
So all the references mentioned in this presentation
are available at this link and the slides are available there as well.
So I think in the interest of time, let me wrap up.
Thank you so much for coming in today and tomorrow I'll explain the deep reinforcement learning game
and the actual competition and how you can win.
Thanks very much guys.
