Placeholder Image

Subtitles section Play video

  • ROSSI LUO: Good afternoon.

  • Welcome to Brown Biostatistics Seminar.

  • And I'm Rossi Luo, faculty host for today's event.

  • And for those of you new to our departmental seminar,

  • the format is usually that the presentation

  • followed by a question and answer session.

  • And because of the size of crowd today, we

  • are going to also use this red box

  • thing to capture your questions and for videotaping and also

  • make sure your questions are heard.

  • And today I'm very pleased to introduce Professor Yann LeCun.

  • Professor LeCun is a director of Facebook AI Research,

  • also known as FAIR.

  • And he is also senior professor of computer science,

  • neuroscience, and electronic computer engineering

  • at New York University.

  • He's also the founding director of NYU Center for Data Science.

  • Before joining NYU, he had a research department

  • for industry, including AT&T and NEC.

  • Professor LeCun has made extraordinary research

  • contributions in machine learning, computer vision,

  • mobile robotics, computational neuroscience.

  • Among this, he's a pioneer in developing

  • convolutional neural networks.

  • And he is also a founding father of convolutional nets.

  • And these works contributed to say

  • the creation of new an exploding field in machine learning

  • called deep learning, which is now

  • called artificial intelligence tool for various range

  • of applications from image to natural text processing.

  • And his research on contributions

  • has earned him many honors and awards

  • including the election to the US National

  • Academy of Engineering.

  • Today he will give a seminar titled,

  • How Can Machines Learn as Efficiently as Animals

  • and Humans.

  • I understand some of you actually

  • told me you drove from Boston or many places are very far.

  • So without further ado, let's welcome Professor Yann LeCun

  • for his talk.

  • [APPLAUSE]

  • YANN LECUN: Thank you very much.

  • It's a pleasure to be here.

  • A game I play now occasionally when I give a talk here is I

  • count how many former colleagues from AT&T are in the room.

  • I count at least two.

  • Chris Rose here, Michael Litman.

  • Maybe that's it.

  • That's pretty good, two.

  • Right.

  • So, how can machines learn as efficiently

  • as animals and humans?

  • A have a terrible confession to make.

  • AI systems today suck.

  • [LAUGHTER]

  • Here it is in a slightly less vernacular form.

  • Recently, I gave a talk at a conference in Columbia

  • called the Compositional and Cognitive Neuroscience

  • Conference.

  • It was the first edition.

  • And there was a keynote.

  • And before me, Josh Tenenbaum give

  • a keynote where he said this.

  • All of these AI systems that we see now, none of them

  • are real AI.

  • And what he means by this is that none of them

  • actually learn stuff that are as complicated as what

  • humans can learn.

  • But also learn stuff as efficiently as

  • what animals seem to learn them.

  • So we don't have robots that are nearly as

  • agile as a cat for example.

  • You know, we have machines that can play golf better

  • than any humans.

  • But that's kind of not quite the same.

  • And so that tells us there are major pieces of learning

  • that we haven't figured out.

  • That animals are able to do that, we don't do--

  • we can't do with our machines.

  • And so, I'm sort of jumping ahead here

  • and telling you the punch line in advance, which

  • is that we need a new paradigm for learning,

  • or a new way of formulating that has old paradigms that

  • will allow machines to learn how the world works the way animals

  • and humans do that.

  • So the current paradigm of learning

  • is basically supervised learning.

  • So all the applications of machine learning,

  • AI, deep learning, all the stuff you see the actual real world

  • applications, most of them use supervised learning.

  • There's a tiny number of them that

  • use reinforcement learning.

  • Most of them use some form of supervised learning.

  • And you know, supervised learning, we all--

  • I'm sure most of you in the room know what it is.

  • You want to build a machine that classifies cars from airplanes.

  • You show an image of a car.

  • If a machine says car, you do nothing.

  • If it says airplane, you adjust the knobs on the machine

  • so that the output gets closer to what you want.

  • And then you show an example of an airplane.

  • And you do the same.

  • And then you keep showing images of airplanes and cars,

  • millions of them, thousands of them.

  • You adjust the knobs a little bit every time.

  • And eventually, the knobs settle on a configuration,

  • if you're lucky enough, that will distinguish every car

  • from every airplane, including the ones

  • that the machine has never seen before.

  • That's called a generalization ability.

  • And what deepening has brought to the table there,

  • unsupervised learning, is the ability

  • to build those machines more or less

  • numerically with very little sort of human input

  • in how the machine needs to be built,

  • except in very general terms.

  • So the limitation of this is that you

  • had to have lots of data that has been labeled by people.

  • And to get a machine to distinguish cars

  • from airplanes, you need to share

  • with thousands of examples.

  • And it's not the case that babies or animals

  • need thousands of examples of each category

  • to be able to recognize.

  • Now, I should say that even with supervised learning,

  • you could do something called transfer learning, where

  • you train a machine to recognize lots of different objects.

  • And then if you want to add a new object category,

  • you can just retrain with very few samples.

  • And generally it works.

  • And so what that says, what that tells

  • you is that when you train a machine,

  • you kind of figure out a way to represent the world that

  • is independent of the task somehow, even though you train

  • it for a particular task.

  • So what did deep learning bring to the table?

  • Deep learning brought to the table

  • the ability to basically train those machines

  • without having to hand craft too many modules of it.

  • The traditional way of doing pattern recognition

  • is you take an image, and you design a feature extractor that

  • turns the image into a list of numbers that can be digested

  • by a learning algorithm, regardless of what

  • your favorite learning algorithm is,

  • linear classifiers, [INAUDIBLE] machines, kernel machines,

  • trees, whatever you want, or neural nets.

  • But you have to preprocess it in a digestible way.

  • And what deep learning has allowed us to do

  • is basically design a learning machine

  • as a cascade of parametrised modules, each of which

  • computes a nonlinear function parametrised

  • by a set of coefficients, and train the whole machine end

  • to end to do a particular task.

  • And this kind of an old idea.

  • People even in the 60s had the idea

  • that this would be great to come up

  • with learning algorithms that would train multilayer systems

  • of this type.

  • They didn't quite have the right framework if you want,

  • neither the right computers for it.

  • And so in the 80s, something came up

  • called back propagation with neural nets

  • that allowed us to do this.

  • And I'm going to come to this in a minute.

  • So the next question you can ask of course

  • is what do you put in those boxes?

  • And the simplest thing you can imagine as a nonlinear

  • function, it has to be non-linear,

  • because otherwise there's no point in stacking boxes.

  • So the simplest thing you can imagine is take an image,

  • think of it as a vector, essentially.

  • Multiply it by a matrix.

  • The coefficient of this matrix are going to be learned.

  • And you can think of every row of this matrix being

  • used to compute a dot product with an input vector.

  • And that produces basically a weighted sum

  • of the inputs multiplied by those coefficients.

  • That gives you another vector.

  • And you pass each component of these vector

  • through a non-linearity like this one, for example.

  • Just halfway ratification.

  • So you have two different steps.

  • Linear, nonlinear.

  • Linear pointwise, nonlinear.

  • Very simple.

  • And you can show that by stacking two layers of this,

  • you can approximate any function you want, as close as you want

  • as long as you have sufficiently many of these guys

  • in the middle by tweaking the parameters of the two layers.

  • But in fact, most functions we're interested in

  • are more economically represented by many layers.

  • And so that's the new approach to deep learning, if you want,

  • that changes from the neural nets of 30 years ago,

  • which typically had only two or three layers.

  • The neural nets of today, the deep learning systems of today

  • have anywhere between 20, 50, or 100 layers.

  • OK.

  • So we have linear operators that are

  • parametrized by coefficients.

  • And the supervised learning, we're basically

  • going to train it to be some sort of objective function

  • that's going to measure the discrepancy

  • between the output the machine produces and the output

  • we want.

  • And so the subjective function is going to be differentiable.

  • What we're going to do is compute the gradient

  • of the objective function with respect

  • to all the parameters in the machine averaged

  • over a number of training samples.

  • Or if we use stochastic gradient decent,

  • averaged over a small batch of training samples,

  • or even a single sample.

  • And then take one step [INAUDIBLE]

  • to get your gradient using the stochastic gradient update

  • rule.

  • Basically, the parameters are going

  • to kind of go down to a minimum in a stochastic fashion

  • as you train more and more.

  • So now the next step you have to do

  • is compute the gradient of the objective function

  • with respect to the parameters.

  • And the way you do this through back propagation.

  • I'm not going to go through this.

  • The mathematical concept on which it's based

  • is incredibly sophisticated.

  • Is it's called chain rule.

  • [LAUGHTER]

  • And some people learn this in high school.

  • And it basically comes down to the fact

  • that if have-- if your ranged parametrized functions

  • in a graph of competition, which in this case

  • is a very simple one.

  • It's just a linear stack of modules.

  • But it doesn't need to be such a simple graph.

  • It could be any graph.

  • And you [? take ?] connection by propagating signals

  • backwards through this graph.

  • Basically taking the gradient of some cost

  • function you want to minimize with respect

  • to this red variable.

  • And so this gradient is represented

  • by these green variable.

  • And multiplying it by the Jacobian of this box,

  • you get the gradient respect to the input of that box.

  • This is chain rule.

  • So it's this guy here.

  • Gradient with respect to the input

  • equals gradient with respect to the output

  • multiplied by Jacobian.

  • Very easy.

  • And so you propagate this backwards through the graph.

  • And the cool thing about this is that you can do this

  • automatically by having a bunch of modules of this type

  • that have been predefined.

  • And you assemble them in a graph.

  • And then automatically you get a gradient back.

  • You don't have to figure out how to compute it.

  • So that's what all of those deep learning frameworks

  • can allow you to do.

  • They're very simple to use.

  • Our favorite one is called PyTorch.

  • And you know, there's several Jacobians

  • for each of those boxes.

  • One that propagates through the input,

  • others that propagate through the parameters.

  • And that allows you to compute all the gradients

  • of the objective function, or whatever

  • you want to minimize with respect to all the parameters.

  • So, OK, back prop.

  • That's an old idea.

  • The basic idea of it actually goes back

  • to Leibniz and Newton, obviously.

  • But more recently, the people in optimal control

  • actually have used things like this

  • to called the adjoint state methods or adjoint system

  • methods for optimal control that was invented in the 60s.

  • That's what NASA used to compute rocket trajectories

  • and things of that type.

  • And it wasn't used for learning.

  • It was used for optimal control.

  • But it is very similar idea.

  • So we think of those variables as being

  • kind of control variables of a rocket,

  • and this being kind of the trajectory

  • the rocket if you want.

  • And then people realized you could use this

  • for learning in the late 70s, early 80s,

  • but never quite actually made it work.

  • And it started being used in the late 80s essentially.

  • And that's when the first wave of neural nets--

  • or the second wave a neural nets took off.

  • And around 1986, 1987 where people

  • realized you could train [? multi ?] neural

  • nets with this.

  • And then it died in the 90s, the mid 90s.

  • OK.

  • So the next question you can ask is those linear operators

  • are nice.

  • But you know, if my image is a long vector with millions

  • of pixels, I'm not going to multiply

  • by matrix that's several million by several million.

  • So you have to organize those linear operators

  • in ways that make them practical for things like images

  • or high dimensional inputs.

  • That's where the idea of convolutional nets comes in.

  • It actually doesn't come from sort of theoretical hypotheses.

  • But it was actually inspired by biology.

  • So I know there are neuroscientists in the room.

  • So this is inspired by Hubel and Wiesel, 1962.

  • Very classical working in neuroscience,

  • Nobel Prize winning work.

  • There were models of--

  • computational models of these most basic ideas

  • by Hubel, Wiesel, by Fukushima and his new neocognitron

  • model that was inspiring for inspiration

  • for convolutional nets.

  • And the basic ideas that individual cortex, and this

  • is something you can derive from first principles,

  • it's probably a good idea images to be

  • able to detect local features by basically having a template

  • that you match with the input.

  • And you get a score for how well this thing matches

  • with this one, basically a dot product, the weighted sum

  • of those pixels by those coefficients.

  • And then you swipe this over the edge everywhere.

  • And the results are recorded in a something

  • we call a feature map here.

  • And that operation is a discrete convolution.

  • But it's very similar to the kind of operation

  • you see, what's called simple cells in the visual cortex

  • do on images, where a particular neuron, an individual cortex

  • is connected to a local neighborhood

  • in the visual field.

  • And sort of detects local features as well.

  • So that's where this first layer is doing.

  • So these are multiple filters.

  • These are the convolutional kernel, [INAUDIBLE]

  • filter applied to this image by use of those maps.

  • And then you do what's called a pooling operation where

  • you take the result, like a local patch

  • of those results of filtering after the non-linearity.

  • And you compute an average or a max or L2 norm,

  • or something like this.

  • And you subsample the results so that the windows

  • over which you compute this aggregation

  • is set by more than one pixel.

  • So here it's set by two pixels.

  • So you get a map that's half the resolution of this one.

  • And then you repeat the process.

  • So you get convolutions again.

  • So this guy is a result of applying convolution kernels

  • to each of those maps, adding up the result,

  • passing it through a non-linearity.

  • And then again, there is pooling and subsampling.

  • So as you go up the layers, you get

  • representations that are more global and kind of more

  • abstract and etc.

  • And this is really the idea of simple cells

  • and complex cells, complex cells being those pooling areas

  • sort of a realization of this.

  • That's the-- drawing from Fukushima's paper

  • on the neocognitron where you had

  • those kind of simple cells and complex cells.

  • So this is a convolutional net.

  • This is meant to be an animation.

  • I'm not sure why it's not an animating.

  • But it's not animating.

  • And not only that, it actually crashed my computer.

  • All right.

  • I'm going to have to do something very brief

  • for just a minute.

  • OK.

  • Now it works.

  • So this is a an old convolutional net

  • trained in the early 90s to recognize handwriting.

  • And what you can see here is that this is the first layer.

  • That's the input.

  • So the first layer, 6 feature maps.

  • Then pooling subsampling, second layer.

  • Pooling subsampling, third layer.

  • And by the time you get here, each unit here, each pixel

  • represents the activation of the a unit.

  • It basically sees the entire input, or at least

  • a square on the input.

  • And so a slice through this represents an entire character

  • essentially in sort of abstract form.

  • And the good thing we realize pretty quickly with it

  • is that we could not just use it to recognize single objects,

  • but also multiple objects.

  • And that's very important.

  • So here we-- you basically have multiple copies

  • of the same convolutional net applied to a sliding window

  • over the input.

  • And it's actually very cheap to do this.

  • You can sort of apply the convolutional net

  • convolutionally.

  • It's convolutions all the way.

  • People sometimes call this [? free ?] convolutional net

  • now.

  • And at the output, you get a score

  • for every window and every category.

  • And here I'm just showing the winning score

  • with kind of a gray scale to indicate

  • the score of the category.

  • And then a very simple post-processing

  • pulls out the correct interpretation.

  • So here, the cool thing is that the system

  • can recognize objects without prior segmentation.

  • You don't have to separate the digits before being

  • able to recognize them.

  • And that's really important if you

  • want to be able to apply those things to natural images

  • where objects appear in the background.

  • And you can't afford to--

  • and you can't actually figure out

  • how to separate them from the background.

  • So that was kind of an important thing.

  • And then going forward a number of years,

  • about almost 10 years to 2003, someone at DARPA came up to us

  • and said, can you use machine learning, neural nets,

  • let's say, to drive robots?

  • And so we built this little track robot here.

  • It's just a radio controlled track

  • with two cameras, analog cameras.

  • And we had this truck being driven

  • by someone for about 20 minutes, or a total of maybe two hours.

  • And that person would be instructed

  • to drive straight and sort of veer off

  • whenever there was an obstacle.

  • And you know, he would--

  • after some training, you feed the network

  • with two images from the two cameras.

  • And then you would just train network

  • to emulate the steering angle of the human driver.

  • And you let the robot loose.

  • And he gets through all this kind of horrible busy Jersey

  • backyard here driving itself through this these obstacles.

  • So we showed these to DARPA.

  • And they said, oh, that's great.

  • We're going to start a program called LAGR

  • and have six different teams compete.

  • That would be nice if this slide actually showed.

  • Here we go.

  • See different teams compete.

  • They will all get the same robot.

  • And you'll train this robot to--

  • using machine learning, to figure out

  • whether it can drive over a particular area or not.

  • And so we used this convolutional net

  • that would look at bands in the image

  • and then label every pixel as to whether it's

  • traversable or not.

  • So something like this.

  • And the cool thing is that you can actually

  • get truth more or less, run truth through stereo vision.

  • So using a stereo vision system, because this robot has

  • multiple cameras, you can figure out

  • if something sticks out of the ground.

  • But that only works up to about 10 meters.

  • Beyond that it doesn't work.

  • So you trained a neural net with the labels collected

  • from stereo.

  • And then you run the neural net on the whole image.

  • And it does this.

  • It figures out where a path is essentially.

  • And it figures out here in the back

  • there is this row of obstacles in the little passage

  • way in between.

  • And so this thin kind worked pretty well.

  • There were again, six different teams competing on this.

  • We were the only ones to use convolutional nets.

  • But again, this was 200--

  • project started in 2005 and ended 2008.

  • And so the fast vision system that

  • uses a stereo, a slow system that uses stereo, and then

  • a slow vision system as well that uses this neural net.

  • And then you put the result. You combine

  • all the results in a map.

  • And you can do some planning to figure out how

  • to get to a particular goal.

  • The map here is centered on the robot.

  • So it's relatively easy to plan.

  • And then the system actually trains itself as it goes.

  • It adapts, collecting labels from the stereo vision.

  • It learns how to navigate new environment it's never seen

  • before, even the pesky grad students who

  • try to annoy this poor robot.

  • [LAUGHTER]

  • The robot weighs about 100 kilos.

  • It can probably break their legs.

  • But they're pretty sure it's not going to do that, because they

  • actually wrote the code.

  • This is-- and they trained it.

  • This was Raia Hadsell, who at that time

  • was a PhD student with me, who now leads the Robotics Research

  • Group at Deepmind.

  • And Pierre Sermanet, who is at Google Brain,

  • also working on robotics.

  • So a couple of years later, we realized

  • we could use the same kind of technology

  • for not just labeling pixels in an image as to whether it's

  • traversable or not, but also labeled with categories.

  • And some datasets started to appear that allowed to train,

  • you know, maybe with a couple thousand

  • images, that allowed to train the convolutional net to do

  • this.

  • So again, this is a convolutional net

  • applied to the whole image.

  • Each output of the convolutional net is influenced by a window

  • on the input, which is something like 40 by 40 pixels

  • at high resolution and 90 by 90 pixels

  • at half, and 180 by 180 pixels at quarter resolution.

  • So it sees a big context to make a decision for a single pixel.

  • But it kind of makes a decision for every pixel.

  • And the cool thing about this is that we

  • can read this in real time.

  • So this was implemented on what's

  • called an FAG, which is sort of a programmable hardware.

  • And it could run at about 20 frames per second classifying

  • to 33 categories.

  • And it wasn't-- far from perfect.

  • You know, it classified those areas here as sand or desert.

  • And this is the middle of Manhattan.

  • So there's no sand I'm aware of.

  • And it worked pretty well.

  • So we submitted a paper to CVPR in 2011.

  • And it was soundly rejected.

  • And the reviewer comments were either what the hell

  • is a convolutional net?

  • Or how is it possible that you get so good results

  • with a technique we've never heard of?

  • So it's kind of funny.

  • So we afterwards submitted it to ACML where it was accepted.

  • And so the funny thing is back in 2011,

  • you couldn't get a paper accepted at a computer vision

  • conference if you use neural nets.

  • Now you cannot get a paper accepted at CVPR unless you

  • actually use convolutional nets.

  • So there's a complete revolution over the next few years.

  • So that gave some ideas to a few people

  • working with driving cars around that time around 2013-14,

  • where they realized they could use

  • those kind of convolutional net based semantic segmentation

  • techniques to label every pixel in an image

  • as to whether it's traversable or not, or as to whether it's

  • a pedestrian or a road or something like this.

  • So this is some work at Nvidia.

  • This is work at Mobileye.

  • Which now belongs to Intel.

  • And this is a system that--

  • Mobileye produces systems that were used in the Tesla cars

  • for autonomous driving until mid 2016.

  • Then the two companies are divorced.

  • They weren't agreeing with each other somehow.

  • So now Tesla is developing its own system.

  • Nvidia has big project on this which I may come back to.

  • And then around 2012, the big revolution occurred.

  • And what that was is the use of very large convolutional nets

  • implemented on GPUs to run really efficiently

  • and train on large datasets like the ImageNet dataset

  • that has a million training samples, 1,000 categories.

  • And it turns out those things work really,

  • really well when you have lots of categories

  • and lots of training samples.

  • And when you make them big.

  • And so the first to really make an efficient implementation

  • of those networks on GPUs were Geoff Hinton

  • and his students, Alex Krizhevsky and Ilya Sutskever.

  • And they had presented the result at an Imagenet workshop

  • at ECCV in Fall 2012.

  • And then had a paper at NIPS in Winter 2012.

  • And that basically made the computer vision

  • field completely change, and basically jump

  • started the deep learning revolution.

  • That revolution had started in speech recognition

  • a couple of years earlier.

  • And the interesting thing about this

  • is that we ended up seeing an inflation

  • in the number of layers that are used

  • by those convolutional nets.

  • So this is the VGG network, which

  • was one of the top performing in 2013.

  • GoogLeNet-- no, this was 2013.

  • Then GoogLeNet in 2014, which had even more layers.

  • And then ResNet.

  • [INAUDIBLE] Hee and his collaborators from Microsoft

  • Research Asia had this idea of having skipping connections

  • that basically solved for the problem that sometimes,

  • when you train a very deep neural net, some of the layers

  • die.

  • The weights don't go anywhere.

  • That kills the entire thing.

  • So they use those kipping connections

  • to prevent the catastrophic bad things happening

  • if some layers died.

  • And that turned out to be a very, very good idea that

  • seems incredibly efficient.

  • But in fact, it works really, really well.

  • And so you can train neural nets with 50 layers,

  • 100 layers, 150 layers.

  • And they work really well.

  • There's sort of a more modern version of this.

  • One version called DenseNet, which

  • is a collaboration between people at FAIR

  • and people at Cornell, which is sort of a version of this

  • is designed to run efficiently and etc.

  • And so one question you might ask

  • is, why do we need all those layers?

  • Right, Theoretically, you can approximate any function

  • with only two layers.

  • Why you need many layers?

  • And you know, one possibility is the fact

  • that the world is compositional.

  • Images are basically composite pixels.

  • And pixels form together, arranged together

  • to form things like edges and colored blobs,

  • and stuff like that.

  • And then by detecting combinations of those,

  • you can detect things like circles and corners

  • and gratings.

  • And then a combination of those form parts of objects.

  • And combination of those objects, et cetera.

  • So there is this kind of hierarchical nature

  • of the perceptual world which is sort of captured

  • by those layered architectures.

  • So we used to take weeks to train those networks.

  • And now we can train one of those networks

  • with basically state of the art performance in about an hour.

  • On a very large machine with 250 GPU cards in it.

  • It's actually multiple machines.

  • Each machine has 8 GPUs.

  • And you stack them up.

  • So you can do these kind of things

  • if you are at Facebook or at Google.

  • A little more difficult in university environment.

  • But here are some more recent results on computer vision.

  • So this is a bit of a snapshot of the state of the art.

  • This is a model called Mask R-CNN, which

  • is a system that does not just semantic segmentation,

  • but instant segmentation.

  • So I'm going to bore you with all the details.

  • I'm just going to tell you that beats

  • all the records on some standard data like, COCO.

  • And here's an example of a result you can do.

  • So again, it's essentially conceptually very simple,

  • a convolutional net with some sort of system

  • that sort of detects regions of interest and then

  • applies a slightly more complex convolutional net

  • on those regions of interest.

  • And the output of the network is not just a category,

  • but it's a category, the coordinates of a bounding box,

  • and an image of a mask of the object at the same resolution

  • as the input.

  • And so you get for every object, you get the category,

  • you get the mask of the person or the object,

  • and you get a bounding box.

  • And it detects baseball, the dog, the individual people,

  • even though they all overlap.

  • So this is instance segmentation, not just

  • semantic segmentation.

  • Semantic segmentation it would have just one big blob here

  • labeled people.

  • You can detect wine glasses and wine bottles, very important

  • for French people, computers, you know, et cetera.

  • Backpacks, umbrellas, sheeps, you can count sheeps.

  • You know, overlapping cars, things like that.

  • It works amazingly well.

  • It's also trained to detect key points on human bodies.

  • So you can infer that the body pose

  • of people in photos and videos.

  • There's actually-- there's more of this which I can't show you.

  • But it actually runs at 5 frames per seconds on a smartphone.

  • So it's scaled down version of this.

  • And then there were kind of new applications

  • of this for convolutional net for 3D data.

  • So this is a recent competition called ShapeNet

  • where the dataset consist of 3D objects represented by point

  • cloud from a depth center.

  • And it's been manually segmented into regions or parts.

  • And the goal here is to essentially label every region

  • with the correct label.

  • And what turned out to win this recent competition

  • was a 3D convolutional net produced by Ben Graham

  • and Laurens van der Maaten.

  • So this is the original paper that

  • describes the idea of a sparse 3D convolutional net.

  • And there's some other contributors to the system.

  • It's a library you can download.

  • It's basically the idea of sort of only doing convolutions

  • in areas where you have populated voxels,

  • because in a 3-D environment, most of the voxels are empty.

  • So you don't want to be computing convolutions

  • everywhere where there is nothing.

  • So you just follow the areas where there is something.

  • And it turns out to be much faster and easier to train.

  • And they actually won the competition with his technique.

  • And other application of convolutional nets

  • that's more research is a system that's actually

  • deployed at Facebook that uses convolutional nets

  • for translation, language translation.

  • So you use feed a sentence in English.

  • And it goes through a bunch of convolutions.

  • And it's actually a gated convolutional network.

  • So those are gated linear units, which I'm not going

  • to go into the details of.

  • There is pointwise multiplication going on here.

  • And then it goes into this kind of a weird alignment

  • system that basically produces sort of German words,

  • word by word, and then kind of lines them up

  • in an appropriate way.

  • And so, it's very fast.

  • It's very efficient.

  • It works really well.

  • And this is what I used for some--

  • for translating from some pairs of languages on Facebook.

  • Facebook can translate 2000 pairs of languages.

  • A number of them are translated using old style phrase based

  • statistical methods.

  • A number of them are translated using recurrent neural nets.

  • And then a small number of them are

  • translated using this system, which

  • is now being trained on more and more language pairs.

  • So a lot of the research that we do at a FAIR-- in fact

  • all of it is open.

  • We publish everything we do, generally

  • very quickly on arXiv.

  • And we also publish most of our coding open source so forth.

  • So these are a few examples of some of stuff we've deployed.

  • We've distributed open source.

  • I would single PyTorch.

  • This is a deep learning framework with a Python front

  • end.

  • It is very simple to use.

  • It's very good for research.

  • It's more transparent than TensorFlow.

  • OK.

  • And there's of course a lot of applications

  • of those things to medical imaging,

  • of course, and things like that, which

  • I'm not personally working on.

  • But a lot of my colleagues are.

  • But what's missing about this is two things.

  • One is, how do we learn reasoning and memory and things

  • like this?

  • And the second one is, how do we learn general things

  • that animals and humans can learn

  • without being told the name of everything,

  • without being given labeled data.

  • So this is a work by a bunch of people from Facebook AI

  • research in Menlo Park in California.

  • Justin Johnson was an intern at Facebook from Stanford.

  • And Fei-Fei Li, his advisor.

  • And the idea here is can we use deep learning to do things

  • like visual reasoning?

  • So could we answer questions like this one.

  • Is there a [? mat ?] cube that has the same size

  • as the red metal object.

  • So you to have to read this a few times and sort of figure

  • out really what operation you have to do here.

  • And so the idea they come up with is very cool.

  • You take the question.

  • Are there more cubes than yellow things?

  • You feed this through a recurrent neural net

  • that represents this as essentially

  • a single vector of fixed size.

  • And then you run this through another recurrent net

  • that spits out a kind of a representation of a computation

  • graph.

  • Think of it as a visual program, which

  • basically gets instantiated in this graph that has one block.

  • Those are actually trainable blocks.

  • OK.

  • They're all the same architecture.

  • So one block that is supposed to figure out--

  • filter all the objects that are yellow.

  • And another one that filters out the cubes.

  • One block that counts how many yellow things there are.

  • This one counts how many cubes there are.

  • And then it compares the two.

  • And then figures out the answer.

  • Right.

  • And so you don't predefine what those blocks should do.

  • You initialize it a little bit by heavy supervision,

  • by specifying what the program here should be,

  • and which blocks should be assembled,

  • even though the blocks are not trained initially.

  • And then you backpropagate the gradients

  • to get the right answer through this whole thing, including

  • the convolutional net.

  • And eventually this thing figures out

  • what those blocks should do.

  • Of course, we'll need to reach all those keywords.

  • And learn how to do reasoning.

  • But the interesting thing about it

  • is that it's completely dynamical.

  • You change the question, it's going to change the graph.

  • So the graph that you propagate gradient through changes

  • every time.

  • And that's why the dynamic graphs are so important in deep

  • learning nowadays.

  • People are so excited about it for things

  • like natural language understanding.

  • So dynamic graphs is the situation

  • where the computational graph that you

  • use to compute your answer changes when the data changes.

  • There's actually more recent work

  • along those lines by Aaron Courville at University

  • of Montreal, where they don't actually

  • have to specify a program like this.

  • You just stack multiple blocks.

  • And it just works.

  • It's pretty cool.

  • OK.

  • So for those statisticians in the room,

  • since I've been invited by bio-statisticians,

  • deep learning breaks all the basic rules of statistics.

  • I mean, not all of them, but some of them, right.

  • So the models are enormous, often

  • with many, many more parameters and there are training samples.

  • I mean, so take one of those convolutional nets

  • for ImageNet.

  • There is 1 million training samples.

  • Some of those models have 100 million parameters.

  • And they still work quite well.

  • They can often nail the training set perfectly.

  • And often there is no explicit regularization.

  • But it still works.

  • How is that possible?

  • The loss function is very highly non-convex.

  • It's got a ridiculously large combinatorial number

  • of settle points.

  • But still, you pretty much get the same result every time you

  • train.

  • What it tells you is that maybe there are local minima,

  • but they're all pretty much equivalent.

  • And in fact, there are experiments

  • that seem to suggest they're all connected.

  • There is only one local minimum basically.

  • I mean, not one.

  • But essentially one.

  • Little attention is paid to managing uncertainty

  • beyond using very simple things like softmax

  • on the output when you do classification.

  • But there's a lot of effort spent on computational issues.

  • Like efficiently implementing all those things, and all

  • that stuff.

  • So it's sort of very much unusual.

  • It breaks the rules you see in textbooks,

  • in statistical textbooks.

  • And that might be a reason why some people who are more

  • theoretically oriented had initially a lot of skepticism

  • towards neural nets.

  • OK.

  • But let me switch to kind of the point I really

  • want to make about with this talk, which

  • is, where do we go from there?

  • OK.

  • So deep learning works very well.

  • There's a lot of applications we can use it for.

  • Even if we don't do any research anymore,

  • just with the technique that we've developed so far,

  • there's probably a lot of different industries

  • that are going to be affected by it that we can apply this to.

  • In fact, there's something that Andrew Ng said recently.

  • Stop doing research.

  • Just apply the stuff that we already know.

  • I don't think it's a good idea.

  • But I don't think he believes it completely either.

  • But what is interesting of him to say this.

  • So what are the obstacles really to making significant progress?

  • Because as I said before, all the stuff you see,

  • that's not real AI.

  • And our machines do not learn with the same kind

  • of efficiency that we observe animals and humans learning

  • with.

  • So how do we get machines to learn

  • how the world works, learn common sense

  • or something like this?

  • So that would ask the question going back to the inspiration

  • from biology, does the brain use a learning algorithm?

  • Or does it use 50 learning algorithms?

  • Or maybe 200?

  • Or maybe it's complete [INAUDIBLE],,

  • the result of evolution.

  • There's no underlying principle behind it.

  • It's just a result of millions of years of evolution.

  • How much prior structure does animal or human learning

  • require for a intelligence to emerge

  • in a reasonable amount of time?

  • All the learning algorithms that people in machine learning

  • have come up with in statistics minimize

  • some sort of objective function, or optimize

  • some sort of objective function, I should say.

  • Does the brain optimize an objective function?

  • What would that function be?

  • If it optimizes a function, does it

  • do it by evaluating a gradient?

  • If it evaluates a gradient, how does it do it?

  • It probably doesn't do backprop in the way

  • that we understand it today.

  • And how does it handle uncertainty

  • in prediction, which I think is a crucial issue?

  • So all kinds of questions like this that connect

  • AI machine learning with neuroscience really.

  • And one big missing ingredient in AI, or maybe a holy grail,

  • is common sense.

  • There's a subarea of AI called commonsense reasoning.

  • It's not actually a solution to a problem.

  • It's more of a problem.

  • And it's a question of how do we get machines

  • to quite common sense.

  • So common sense is everyday--

  • the commonsense of everyday thing.

  • That supported-- unsupported objects fall.

  • That some objects are stable.

  • And some are not.

  • If I let this guy go, it's going to fall,

  • even if I put it briefly vertically.

  • If I take this object, I hide it behind my computer,

  • you still know it's here.

  • It hasn't disappeared.

  • So object permanence.

  • So those things we learn.

  • How do we learn the structure of the world?

  • And one hypothesis perhaps is that our brains

  • are prediction machines.

  • They learn to predict all the missing information

  • from whatever is available today at this time.

  • And then time passes by.

  • Or you move your head, or whatever.

  • And new information becomes available.

  • And that allows you to train your world model

  • with the new information.

  • So if I want to learn that the world is three dimensional,

  • I'm going to learn it because it's

  • the best explanation for how the world changes

  • when I move my head.

  • My view of the world changes when

  • I move my head side to side.

  • And the best explanation for how it changes

  • is the notion of depth.

  • So necessarily, if my brain is trained to predict

  • what the world is going to look like when I move my head,

  • it's going to have to somehow represent the notion of depth.

  • Same way if I want to predict--

  • if I let this go and I stop the movie right there,

  • then I ask the machine, ask my brain

  • what's going to happen next?

  • It's going to predict this guy is going to fall--

  • he's going to fall down, of course, because of gravity.

  • So it just needs to wait for time

  • to pass by to train itself to see

  • if its prediction was correct.

  • So that would be predictive learning.

  • But predicting-- learning to predict

  • is not just predicting the future

  • from the present and the past.

  • It might be also predicting what the blind spot of a retina

  • contains without even looking.

  • So if you fixate on a particular place,

  • there is a particular spot in your visual field where you're

  • essentially blind because that's where

  • your optical nerve [? puncture ?]

  • through your retina.

  • You don't see anything at there.

  • But you don't realize it, because your brain

  • fills it up essentially.

  • So things like filling the visual field

  • of the regional blind spot, filling occluded images,

  • missing segments in speech, predicting

  • the state of the world from partial textual description,

  • predicting the consequences of your action,

  • predicting sequences of action leading to a result.

  • I mean, all of those are fill in the blanks, if you want.

  • And common sense, I would surmise,

  • is the ability to fill in the blanks

  • through the construction of world models.

  • Object permanence is something babies learn around

  • the age of two or three months.

  • And which is why peekaboo is so funny for little babies,

  • because you can disappear when you hide your face.

  • So here's a baby orangutan here.

  • It's being shown a magic trick.

  • The guy put an object in the cup.

  • And then he shakes the cup.

  • It takes the object out without showing the orangutan.

  • And then shows the inside cup.

  • And the cup is empty.

  • And the orangutan rolls on the floor laughing.

  • OK.

  • That obviously broke his world model, that objects--

  • there's object permanence.

  • Objects don't disappear like that.

  • And you know, one of three things

  • can happen when your world model is broken, you laugh.

  • It's really funny.

  • It's really interesting, you pay attention,

  • because your role model is wrong.

  • So you need to learn a new world model basically,

  • because of this new data that you predicted wrongly.

  • Or something really dangerous might

  • happen that you didn't predict.

  • And so you're scared.

  • So that's what happens when your world model is working.

  • So I think-- how do we do this a machine?

  • How do we get them to learn all those things about the world?

  • Lean gravity?

  • So if you show a baby, this are special slides

  • I borrowed from Emmanuel Dupoux, who

  • is a cognitive scientist, developmental cognitive

  • scientist in Paris at Ecole Normale Superieur.

  • And if you do an experiment like this,

  • you take this little car here.

  • And you put it on this support.

  • And you push it.

  • And it goes off, and it doesn't fall.

  • Of course, it's held in the back.

  • But the baby doesn't see that.

  • Before six months, the baby says, yeah, sure.

  • That's way the world works.

  • Fine.

  • No problem.

  • After eight months, they go like this.

  • You know, they open their eyes.

  • And they fixate.

  • And they say, what's going on?

  • And they don't say, what's going on,

  • obviously because they can't talk.

  • But you know, they look like they're saying,

  • what's going on.

  • And so with this kind of technique,

  • by basically measuring how long you know babies

  • fixate and observe and open their eyes like crazy,

  • you can figure out at what stage babies learn things.

  • And again, this is from Emmanuel Dupoux.

  • So things like object permanence you learn pretty quickly.

  • Biological motion, the fact that there

  • are objects that move by themselves,

  • others that are inanimate.

  • You know, you learn that by three months.

  • Objects that are rigid or not.

  • Different types of natural categories, chairs, tables cars

  • etc.

  • Stability and support.

  • And sort of basic intuitive physics, gravity, inertia,

  • conservation of momentum.

  • That arrives around 8 months, roughly

  • between six and eight months.

  • And there's a bunch of other things like that

  • happen at various stages.

  • And this is not learned in supervised mode.

  • It's not like, babies are told the name of objects.

  • It's not like they are directed in any way for any of this.

  • They basically learn this by observation.

  • They're really not well-developed in sort

  • of motor control either.

  • So they don't get to do a huge amount of interaction

  • with the world.

  • So there's no way this can be learned through interaction,

  • by some sort of direct reinforcement learning.

  • There's other mechanism going on there

  • where you learn how the world works by observation.

  • And that's the piece we're missing in our current machine

  • learning and AI systems.

  • So in fact, I need to apologize in advance to Michael.

  • But he knows what I'm going to show, so--

  • There's three sort of paradigms of learning, right.

  • There is a reinforcement learning,

  • where basically the machine at each trial

  • is given a scalar value to tell it whether it did well enough

  • or not.

  • So there was grade for games.

  • Machine does an action.

  • And it either gets a reward or not.

  • Or sometimes it has to make a whole sequence of action

  • before it gets a reward.

  • And it works great when it's combined with deep learning.

  • The problem is that it requires a huge amount

  • of training samples, an enormous amount of training samples.

  • It's because the amount of information

  • you give to the machine is extremely small at every trial.

  • It's very weak.

  • It's a small amount of information.

  • Therefore, you need to do this many, many times for it

  • to learn anything complicated.

  • Supervised learning, you need a little less samples,

  • because you give more information every time.

  • You give it the correct answer.

  • And so if there are a dozen categories,

  • that's more than just a single scalar value.

  • So you need fewer samples to learn similarly complex tasks.

  • And then the predictive learning or unsupervised learning,

  • you ask the machine to predict basically every variable

  • from every feature variable from every present variable

  • or past variable, or every unseen variable from every seen

  • variable.

  • And so there is a lot more information

  • you ask the machine to predict.

  • And that's why probably you can learn

  • a lot more about the structure of the world this way.

  • So that led me to this completely obnoxious slide,

  • which I have to show in every slide-- in every talk now.

  • The analogy between intelligence and chocolate cake,

  • where the [INAUDIBLE] of the cake

  • is basically unsupervised or predictive learning,

  • because that's where the bulk of the information goes.

  • The bulk of the information given to the machine

  • is really in that mode of learning.

  • And then the icing on the cake is supervised learning.

  • There is considerably less information

  • provided to the machine per trial in supervised mode.

  • And in reinforcement mode there is very little information

  • given to the machine.

  • So that's going to be equivalent to the cherry on the cake.

  • And I've been showing this-- the first time

  • I showed this slide was actually giving a talk at Deepmind,

  • where Deepmind is actually the temple of reinforcement

  • learning.

  • So it was sort of obnoxious on purpose, a little bit.

  • But now I kind of fell into that obsession

  • of showing it in every talk.

  • So the problem with reinforcement learning,

  • with pure reinforcement learning, and Michael

  • will correct me if I'm wrong, is that if you use it

  • in its purest form, you need so many trials to learn

  • any kind of complex behavior that if you were

  • to train a self-driving car to drive, and to learn to not

  • run off a cliff, it would have to run off a cliff

  • about 50,000 times before it figures out it's a bad idea.

  • And then another 50 dozen times before it

  • figures out how not to run off a cliff.

  • And you know it's half of a joke, which is why--

  • I mean, that's the reason why it works really well for games,

  • because you can run games very quickly

  • on many computers at the same time

  • and at many thousands of frames per second.

  • But it doesn't really work in the real world,

  • because you cannot run the real world faster than real time.

  • That's a thing that sucks about the world.

  • And then anything you do real world can kill you,

  • like running off cliffs.

  • Maybe it's a good thing that we can't run the real world faster

  • than real time.

  • So perhaps what we need is build models of the world

  • that we can run faster than real time,

  • and that we can run without the risk of killing ourselves.

  • And that would be predictive models.

  • If we ever were to predict before we run off

  • a cliff that we're going to run off a cliff,

  • we would not run off a cliff.

  • And perhaps, that's the way we learn to drive.

  • We know not to get off the road, because we

  • know bad things will happen if that's the case.

  • Reinforcement learning works really well for games.

  • And there was a smashing demonstration

  • of how well this works for Atari games

  • and Go and doom, and not yet StarCraft, that's

  • very much work in progress at FAIR and Deepmind

  • and various other places.

  • It's very complicated.

  • But you know, it works really well.

  • And the latest AlphaGo Zero is pretty amazing in that way.

  • But again, it's a particularly simple situation

  • where the number of actions is discrete,

  • the world is completely observable,

  • and the reward is fairly clear.

  • And you can run the environment, which is a go board,

  • at tens of thousands of frames per second essentially.

  • It works pretty well, even for games like Doom.

  • So this is a Doom competition that was

  • won by the team from Facebook.

  • And actually teams with Facebook people won two years in a row,

  • in '16 and '17 using basically deep reinforcement

  • learning techniques.

  • So we work on reinforcement learning at Facebook.

  • It's not--

  • The cake I showed--

  • I showed the cake, but you have to notice that this is

  • a black forest chocolate cake.

  • And the cherry is not optional on this cake.

  • In fact, it's got little bits of cherries

  • all around here inside.

  • [LAUGHTER]

  • OK as I said, we also work on StarCraft.

  • So StarCraft is an extremely challenging situation,

  • because there is multiple time scales.

  • There are continuous actions.

  • It's not fully observable.

  • You can't tell what your opponent is doing unless you

  • send scouts to look at it.

  • So it's very complicated in that sense.

  • We've done a little bit of reinforcement training

  • for sort of local micro-management of tactics.

  • It's actually an open source platform called ELM or miniRTS

  • from Facebook that is basically a StarCraft like real time

  • strategy game.

  • But here is a suggestion.

  • So I said we need our machines to be able to learn

  • predictive models of the world.

  • And this idea is very old.

  • It goes back to a very old time.

  • But in particular, to one of Rich Sutton's papers

  • where he was proposing what he called the Dyna architecture.

  • And he said the main idea of Dyna

  • is the old common sense idea that planning is trying things

  • in your head using an internal model of the world.

  • And this suggests existence of a more primitive process

  • for training things not in your head,

  • but through direct interaction with the world.

  • So he said here, reinforcement learning

  • is the name we use for this more primitive and direct kind

  • of training.

  • And Dyna is the extension of reinforcement

  • learning to include a [INAUDIBLE] world model.

  • In fact, this [? domain picture ?]

  • doesn't exist today.

  • All of this is called reinforcement learning.

  • It's just that the version that has a model

  • is called model based reinforcement learning.

  • And the other one is called model free reinforcement

  • learning.

  • But it's basically the same, the same thing.

  • And this idea that you should have a world model which

  • in optimal control is called a plant simulator,

  • but it's the same thing, or a plant model.

  • But this idea that [INAUDIBLE] predictive world model

  • to be able to reason about what to do, what action to take,

  • is really [? all ?] idea in the context of optimal control.

  • So a typical situation in optimal

  • control, and you can look at classical textbooks going back

  • to the 60s, is you have a model of the world that gives you

  • the state of the world at time t plus 1

  • as a function of [? standard ?] time t.

  • And the action [? you can ?] [? take. ?]

  • And then the state of the world is

  • sent to an objective function that

  • measures how well the state of the world is,

  • or how good it is.

  • And so you can run this model of the world.

  • And through backprop through time and gradient descent

  • figure out a sequence of commands

  • that will optimize this objective function over time.

  • And if you're well-simulator is differentiable,

  • you can do this through backprop and gradient decent.

  • If it's not, you have to do things [INAUDIBLE]

  • programming or something like this.

  • So the main problem we're going to have is,

  • how do we learn this world model?

  • How do we learn a model that will allow our mission

  • to predict what the state of the world at time t plus 1

  • is going to be as a function of the state

  • at time t and our action, and perhaps actions

  • of others in the environment.

  • That's the problem of predictive or unsupervised learning.

  • And that led me to state that--

  • oops.

  • I'm not sure how that happened.

  • Apologies.

  • Wow, it went forward by like 10 slides.

  • So that is new to this statement that the next revolution in AI

  • will not be supervised.

  • I stole the concept of this slide

  • from Alyosha Efros at Berkeley.

  • And so we have to think about what

  • would be the architecture of a real intelligent system, a sort

  • of autonomous intelligence system.

  • So it would be something like this, an agent that

  • produces actions on the world.

  • And the world responds with percepts.

  • And of course, the world might be--

  • the world might not care about your action at all.

  • Or it might care only vaguely.

  • What the agent is trying to do, the agent

  • has an internal state which is sent to an objective function.

  • And the objective function produces

  • a value that basically tells the agent whether it's happy

  • or not.

  • So the objective function is a measure

  • of unhappiness of that agent.

  • You get a small value if you're happy, a large value if you

  • are unhappy.

  • So what the agent is trying to do

  • is bring the world into a state that will bring itself

  • into a mental state that basically this red function

  • identifies as happy.

  • And there are models of how animal brains are built, are

  • basically this way, where this is your entire brain,

  • except the basal ganglia.

  • And that's the basal ganglia.

  • So basal ganglia is the thing at the bottom of your brain

  • that basically determines your level of happiness or comfort

  • or discomfort or pain or things like that.

  • So inside of this agent, if we believe

  • what I-- or the argument that I previously, the system should

  • have some sort of world simulator

  • that allows you to predict what the state the world is

  • going to be as a consequences of a sequence of actions.

  • And then two other modules.

  • These are sort of standard nomenclature in RL.

  • An actor that produces action proposals that can be

  • kind of simulated in the world.

  • And then a critic whose role is to predict

  • the long term expected value of this objective.

  • So this guy basically computes emotions.

  • So if this guy predicts that your objective function is

  • going to rise up, make you very unhappy or in pain,

  • that creates fear, essentially.

  • You don't want to get anywhere near that state.

  • And this guy predicts what happens.

  • So this guy predicts this.

  • This guy doesn't quite predict that.

  • But this guy actually predicts that as well.

  • And so now the problem becomes, how do we

  • train this world simulator?

  • Because the rest, we kind of know how to do it more or less.

  • We don't know how to build this.

  • But if we knew, we could do something like this.

  • Get the state of the world through your perception module,

  • initialization your world simulator,

  • propose a sequence of actions, and then

  • refine the sequence of actions so as

  • to minimize the expected cost computed by the critic.

  • And then train the actor to produce this optimal sequence

  • of actions.

  • And then take the first action.

  • And then kind of shift everything by one time stamp.

  • So how do we learn forward models of the world?

  • This is an experiment that was done at Facebook

  • a couple of years ago by Adam Lere, Sam Gross, and Rob Fergus

  • where they put a stack of cubes, this is in a simulator.

  • This isn't the real world.

  • And then they observe what actually occurs.

  • And then they train a convolutional net

  • to actually predict what's going to happen by kind of learning

  • the mask of the objects.

  • And what you get is a pretty accurate prediction

  • for this tower is going to fall this way.

  • But fairly fuzzy predictions for like, tall towers,

  • where it's king of ambiguous where things are going to fall.

  • So you get those kind of fuzzy predictions here.

  • Because you can't exactly predicting where

  • things are going to fall.

  • So how do we solve that problem?

  • I'm going to skip this.

  • So this is why predictive models are

  • good for question answering systems and natural language

  • processing.

  • But I'm going to skip this in the interest of time.

  • So, here's the problem we have to deal with.

  • Those towers can fall in a number of different directions

  • that we can't really predict just from the look of it

  • which direction they're going to fall into.

  • So it's kind of--

  • I don't know if we can find a pen here

  • or any kind of vertical thing.

  • I'm going to do it with a piece of paper.

  • So if I put this piece of paper here on the table,

  • and I let it go, you can be pretty sure it's going to fall.

  • But you can't really tell probably which direction

  • it's going to fall.

  • Every time I do it, it's probably

  • going to fall into a different direction.

  • So you can't really use supervised

  • learning to train something like this.

  • Because if I give the initial segment,

  • and then I ask machine predict, the machine predicts that.

  • If that happens, that's fine.

  • If this happens, then the mission

  • has to predict now this.

  • But now the next time over, it's going to predict that.

  • And so the best thing the machine can predict

  • is kind of an average of the outcomes, which

  • is not a good answer.

  • And so, something like this, where let's say you

  • observe two variables which have a dependency between them.

  • And this is pretty elementary for anybody who

  • works on probabilistic models.

  • But let's say these are the data points you observe.

  • Your world consists of two variables.

  • And these are your observations.

  • If I give you a particular value of Y2,

  • you can infer basically two values for Y1.

  • But if you try to learn this with L2 least square criterion,

  • you're going to predict something right in the middle,

  • which is not a good answer.

  • So you have to predict, somehow be

  • able to predict one or the other,

  • but not an average of the two.

  • Or predict a distribution.

  • But how do you represent distributions

  • in high dimensional spaces?

  • So the unsupervised learning problem

  • is how do you capture the dependency between things

  • like this?

  • And one possible way is to learn a contrast function.

  • So basically, think of it as an energy function,

  • or negative lo log probability if you are a probabilist.

  • And this are your data points.

  • And you want those to have low energy, which

  • means high probability.

  • And you want everything else to have higher energy, or lower

  • probability.

  • So the blue points are the data that you observe.

  • The green points are not data.

  • And you want the energy of the green points

  • to be higher than the energy of the blue points.

  • So if you have a parametrised function that

  • computes this function in the space of Ys,

  • it's easy enough to tweak its parameters

  • so that when you see a blue point,

  • you make the output go down.

  • But how you make sure at the value of your function

  • is higher outside of those needs?

  • How you generate those green points?

  • And that's basically-- there's basically

  • seven or eight different methods for doing this.

  • But I'm only going to talk about a couple.

  • And the first one is adversarial training.

  • So adversarial-- the basic idea of adversarial training

  • is basically the scenario I was talking about.

  • You have a predictor here.

  • And this predictor looks at the past,

  • let's say, if you want to do video production.

  • So it looks at the past.

  • And it has access to a source of random vectors

  • and is going to produce a prediction.

  • The precise prediction is going to depend

  • on the value of this vector.

  • And as the value of this vector changes,

  • this prediction goes through a set of plausible outputs,

  • let's say, represented by this red ribbon here.

  • So let's say we asked the machine.

  • We show the machine a small segment of video.

  • And we ask it, what is the world going

  • to look like half a second from now?

  • And the machine predicts this.

  • It predicts that pen is going to fall to the back and the left.

  • And in fact, we let time pass by.

  • And what happens is this.

  • The pen falls to the back and slightly to the right.

  • So we don't want to punish the machine

  • for making the wrong decision here, because it's

  • qualitatively correct.

  • So what we'd like is we'd like an objective function that

  • tells us low cost if you are on this red ribbon, high cost

  • if you are outside.

  • And that's exactly what I was talking about earlier.

  • You want a function like this one

  • that tells you low cost if it's something

  • that looks reasonable.

  • High cost if it's not.

  • So the thing is, we don't know how you

  • characterize this functions.

  • So we're going to have to learn it.

  • So adversarial training is you have two functions

  • you learn, one that predicts and one that tells the system

  • whether the predictions are good or not.

  • And basically it works like this.

  • So you have an initial segment of a video.

  • For example, if you do video prediction,

  • the data tells you here is how the video ends.

  • And you train this contrast function,

  • called the discriminator, or sometimes critic actually ,

  • to produce a low output for things that actually occur

  • in the world.

  • So those are the two blue points.

  • So we'll make the function take a low value for things

  • actually occur.

  • And then you this past to the generator.

  • You have it generate a prediction,

  • which initially sucks.

  • And so you feed it to the discriminator.

  • And it tells the discriminator produce a large output

  • here to make the output here.

  • So these are all of the green points.

  • Make that large.

  • And so next time around, the value here the discriminator

  • will produce for those predictions

  • is going to be higher.

  • But here is what you do simultaneously.

  • Simultaneously, you backpropagate gradients

  • through the discriminator to train the generator

  • to produce Ys that make the discriminator produce

  • low outputs.

  • OK.

  • So basically, the generator gets information

  • about how to change its parameters so as

  • to change its output so that the green points get closer

  • to the blueprints, essentially, to a region

  • that the discriminator give low energy to.

  • So eventually it looks like this, where the green points

  • match the blue points more or less in distribution

  • if you're lucky, because those things are kind of finicky.

  • And it works.

  • So you can train those things with past frames.

  • Or you can just train it on images to just generate images

  • from random vectors.

  • So this thing has access to all sorts of vectors.

  • If you trend this thing on images of bedrooms, you get--

  • those are non-existing generated bedrooms.

  • And they all look kind of reasonable,

  • except maybe for this guy.

  • It looks an Austin Powers kind of bedroom, or whatever.

  • But you know, they all have a bed and windows and dressers

  • and lights, and stuff like that.

  • And those are basically a bunch of random numbers coming

  • into a convolutional net that has been trained

  • to produce bedroom images.

  • And they don't look like anything in a training set.

  • They're different from any training set image.

  • So there are various versions of those GANs.

  • There's a whole menagerie of different types of GANs

  • nowadays.

  • There are [? psycho ?] GANs and infoGANs and WGANs and IWGANs,

  • and an infinite number of GANs.

  • There is another family of generative models,

  • this type called variational [INAUDIBLE] encoders.

  • This is when trained on ImageNet.

  • So this is something called Energy-Based GAN trained

  • on ImageNet.

  • And it doesn't actually produce objects.

  • But you put things that from far away kind

  • of looks like objects, [INAUDIBLE] abstract.

  • This is trained on dogs.

  • It's kind of funny.

  • I mean, people do much better than this now.

  • But it's still funny.

  • OK.

  • So here is an example for video production.

  • So here it's a convolutional net that looks at 4 frames

  • and predicts two frames, two future frames.

  • And it looks at the images at multiple scales.

  • And there's all kinds-- and it's pretty complicated

  • architecture.

  • And this is the prediction you get

  • if you train with least square.

  • So you train this video predictor with least square.

  • You get blurry predictions.

  • If you train it with this adversarial training

  • criteria combined with some others,

  • you get this kind of prediction, considerably sharper.

  • So the first four frames are observed.

  • The last two frames are indicated in red

  • here are predicted.

  • And so you get--

  • the motions basically continue.

  • And they seem fairly reasonable.

  • There's a little bit of blurriness.

  • But it's is not too bad.

  • This is when trained on video segments

  • from apartments in New York.

  • So the camera rotates.

  • And the system has to basically invent

  • what the room looks like as the camera rotates.

  • So here is a bookcase.

  • And this part of the bookcase-- so this is observed.

  • Now it's predicted.

  • This part of the bookcase is invented.

  • So it figures out that a bookcase has to continue.

  • It figures out that a couch has to continue.

  • So it captures some regularity of what an apartment in New

  • York is supposed to look like.

  • Something that maybe is more interesting for people

  • interested self-driving cars.

  • This is a dataset called cityscape.

  • And-- oops.

  • And this is a system where you take a video sequence,

  • and you run a semantic segmentation system

  • on the video sequence.

  • So what you get is a bunch of maps

  • which give you the pixels that are

  • labeled for every category for every pixel.

  • So much like this, blue is car.

  • Sidewalk is pink.

  • And pedestrian is red.

  • And things like that.

  • And what this thing predicts is that-- so it

  • predicts in this case here half a second in the future.

  • It predicts that pedestrians keep crossing the street.

  • The car that is turning left keeps turning left.

  • The scenery keeps moving.

  • So it's useful if you want to work and self-driving cars

  • to have the ability to predict what's going to happen ahead

  • before it happens.

  • It might allow you to use this to train

  • for example, a reinforcement learning system

  • without actually crashing, but just by predicting

  • even a crash.

  • Here's a new model, a more recent one

  • just admitted actually called error encoding network.

  • So this one-- in fact, the one that actually works

  • is slightly different from this one.

  • But this is a simpler version to explain.

  • So this one basically trains a model.

  • So it looks at the past.

  • It runs through a few layers of a neural net.

  • It produces an internal state.

  • And ignore the top for the time being.

  • Then runs through a generator essentially,

  • another part of a neural net that produces a prediction,

  • say a video, another frame in the video.

  • And you train this using least square,

  • or something like this with what is actually observed.

  • And then you play a trick.

  • What you do is you take the difference between those two.

  • So this is a vector, the vector of the difference

  • between those two, the target and the prediction.

  • You feed this to a parametrised trainable function.

  • And then you feed the output of that function

  • to the hidden layer.

  • You add it to the hidden layer.

  • And you train this guy so that this variable

  • is going to take a value that minimizes the prediction error.

  • But this viable only depends on the prediction error.

  • And so basically, this part of the network,

  • when this value is set to zero, predicts

  • whatever is predictable.

  • And this guy basically parametrise

  • whatever is not predictable, which is a residual error,

  • and figures out how to represent the hidden latent variable that

  • will actually correct that mistake.

  • So that might represent the--

  • for example, you observe someone playing a game

  • and moving something on the screen.

  • The physics of how things move on the screen

  • is essentially predictable.

  • That's Newtonian physics.

  • But the action that the player uses maybe isn't.

  • And so that would essentially represent the action

  • that the player played.

  • That would be very useful for things like imitation learning,

  • for example.

  • Here's an example of how this can be used.

  • And I'm probably going to end here.

  • So you have to wait a little bit.

  • So this is a dataset that was produced

  • by Sergey Levine, [INAUDIBLE] and a few other people

  • at Berkeley.

  • So there is an object.

  • There is a robot arm.

  • And the robot randomly pokes the object.

  • So the result is that after being poked,

  • the object has moved a little bit.

  • And these are predictions for how the object could

  • have been moved by the thing.

  • This is pure pixel prediction, pixel space prediction.

  • So the system has no notion of object or anything.

  • These are prediction it makes.

  • And each different prediction is generated by different sampling

  • of the Z variable, the latent variable, or the action

  • variable.

  • You can think of this as basically an encoding of what

  • the robot arm did without actually

  • having to observe what it did.

  • So it's action inference if you want.

  • OK.

  • I've spoken for long enough, so I'm

  • going to stop here and take your questions.

  • Thank you very much.

  • [APPLAUSE]

  • AUDIENCE: Hey.

  • [INAUDIBLE]

  • Real quick question.

  • So can you break-- so, let's just think about images.

  • Are you trying to-- or we use essentially biology and things

  • we know about the world to segment the image.

  • What if you took a camera and did

  • a combinatorial scramble, which is a huge potential scramble.

  • Does it break everything?

  • YANN LECUN: It scrambles the pixels?

  • AUDIENCE: It scrambles the pixels.

  • YANN LECUN: Yeah.

  • AUDIENCE: You know, it's combinatorially huge.

  • YANN LECUN: Yeah, that's right.

  • So if you do a fixed scramble and you

  • use a convolutional net, the convolutional net

  • will have a hard time figuring out the thing,

  • because it's based on the idea that neighboring pixels are

  • correlated.

  • And a local patch of pixels can be represented efficiently

  • by just those features.

  • So it probably would have a very hard time.

  • Now it turns out there's a paper by Pascal [INAUDIBLE]

  • on [INAUDIBLE] from way back where

  • they show that if you just-- if you take a collection of images

  • that you've perturbed through the fixed

  • permutation of the pixels, you can actually

  • recover the topology by figuring out

  • the local correlations between pixels.

  • So in principle, it would be possible to make this work

  • if you [? hardwired ?] this.

  • AUDIENCE: Thank you for giving a talk today.

  • I'm a big fan to you, actually.

  • [INAUDIBLE] talk to me.

  • And recently the D-Wave Systems and the quantum computer

  • is actually deployed in practice right now.

  • And how would you envision the quantum computing

  • affect the deep neural networks in general?

  • YANN LECUN: Yeah, it's--

  • if you didn't hear the question, it's

  • about whether quantum computing will affect deep learning

  • in some way.

  • It's not entirely clear to me.

  • So D-Wave is not actually deployed in practice.

  • It's experimented with by people.

  • And there are a few attempts.

  • But it's not actually used in practice

  • for commercial deployment, if that's the question.

  • So the D-Wave System is not a full quantum computer

  • in the sense that it uses quantum tunneling

  • for more efficient function optimization.

  • It's not entirely clear that you need this

  • at all for any of the tasks that I talked about.

  • So I think it's still up in the air

  • whether or quantum computing will have any effect.

  • It's possible you could do nearest neighbor much

  • faster with quantum computing.

  • It's not even clear to me that you can, but it's possible.

  • So, it's unclear.

  • AUDIENCE: So I actually have two questions.

  • The first question is that [INAUDIBLE]

  • if the data point is very small, like in the area

  • of a [INAUDIBLE],, but only [INAUDIBLE] maybe X-Ray imaging

  • or even less.

  • [INAUDIBLE] So I read something about the [? zero ?] shot,

  • one shot, and [? two ?] shot [INAUDIBLE]..

  • So what do you think of [INAUDIBLE]..

  • And the second question is are any of the AI [INAUDIBLE]

  • developed by Facebook or developed [INAUDIBLE],,

  • [INAUDIBLE].

  • YANN LECUN: All right.

  • Yeah, OK.

  • Let me answer first question first.

  • So the small the regime.

  • There's basically currently two ways to handle it.

  • One is transfer learning.

  • So for example, you want to do image recognition.

  • And you want to do, I don't know,

  • medical imaging or something like this.

  • And you don't have enough data.

  • So one approach is you train your neural net

  • on a big data set that you actually

  • have, either with the same type of images,

  • or even complete different types of images, as long

  • as the statistics are similar, like ImageNet for example.

  • You know, it's not the same type of image.

  • But it's OK.

  • [INAUDIBLE]

  • And then you can transfer learning.

  • So you take that pre-trained machine.

  • And then you retrain this machine for your data

  • that helps you just retrain the top two or three

  • layers to a limit the number of parameters.

  • That works really well.

  • So there is actually a service within Facebook

  • that uses this for the product division within Facebook.

  • So to give you an idea, there's 2.1 billion users on Facebook.

  • And the users upload on the order of 1.5 billion photos

  • every day.

  • So there's 1.5 billion a day.

  • Every single one of those photos go

  • through four convolutional nets that we know about.

  • It goes way more.

  • But these four pre-trained convolutional nets.

  • So one that basically recognizes tags

  • of various types on the image.

  • So recognizes objects.

  • It recognizes the type of images.

  • Is this a birthday or a wedding or landscape or indoor scene

  • or a [? macrophoto ?] or whatever.

  • There's a second one that--

  • and this is used for feed ranking basically,

  • to decide whether to show particular images

  • to particular people who have particular interests.

  • The second one filters objectionable content.

  • So basically, violence, pornography, things like that.

  • The third one generates captions for images,

  • for the visually impaired.

  • So that if you're blind and you're on Facebook,

  • you can get an idea of what's in the picture

  • by getting this text description.

  • And then the last one, which is turned on in US,

  • but not in other countries, not in many other countries,

  • not turned on in Europe does face detection.

  • So it tags your friends automatically.

  • So that was for the first question.

  • Now there's a second answer to the first question.

  • And the second answer to the first question

  • is you can use unsupervised training or pre-training.

  • So basically, you don't trust trained system

  • to classify your medical images into cancer or non-cancer.

  • But you also train it to reconstruct itself.

  • And that has a regularization effect.

  • So there are situations, certain types of architectures,

  • things called ladder networks or what stack [INAUDIBLE] or UNet,

  • where this type of learning actually

  • helps supervised learning and reduces

  • the need for labeled data.

  • OK.

  • So that was-- ultimately, I think

  • that supervised learning is going

  • to solve all of these problems.

  • Now your second question was about those bots

  • that there was a big story in the press a few months ago

  • that said that researchers at Facebook

  • had created two bots that were supposed to talk

  • to each other in English.

  • And they're supposed to cooperate to solve a task.

  • It's going to reinforcement learning type task.

  • And they ended up using English language

  • in ways that were not really initially predicted.

  • They would use a funny way to use words to express--

  • to communicate with each other.

  • And so some of the newspapers right after, it almost

  • said AI's going to kill us all.

  • Some tabloid published an article saying, oh my god,

  • Facebook researchers had this project where two bots invented

  • their own language.

  • And they had to like unplug the computer in panic mode,

  • because they were going to take over the world or something.

  • And it's completely insane, because there

  • was a blog post about it and a paper that was published.

  • And it's basically, these people are

  • interested in natural language understanding.

  • And they trained those systems to use English.

  • And they ended up not using English

  • in a way you would normally use it.

  • So they said, the experiment failed.

  • Let's try something else.

  • It's not like the Hollywood sci-fi movie

  • where you see these guys grabbing the electronic cars,

  • and there's sparks flying and all that stuff right.

  • Nothing like that.

  • But it's really funny how--

  • funny in a way, kind of depressing a little bit,

  • of how some of the press describes those things.

  • There were a lot of articles in more serious press afterward

  • that said that's complete bunk, which is good.

  • AUDIENCE: Thank you.

  • AUDIENCE: Hi.

  • I have a comment here.

  • I have a comment and a question.

  • First comment is that earlier you said

  • there are many systems that hasn't been in the parameters

  • that much more than the number of pixels

  • or whatever you're talking--

  • YANN LECUN: Samples.

  • AUDIENCE: Samples.

  • [INAUDIBLE]

  • I think from a statistics point of view,

  • it's the central limit theorem [? doing it's ?] [? job. ?]

  • That's my comment.

  • YANN LECUN: Which theory?

  • AUDIENCE: Central limit.

  • YANN LECUN: Oh, central limit theorem.

  • AUDIENCE: [INAUDIBLE] I think.

  • But, OK.

  • My second question is actually related to this.

  • Are there-- all your examples kind of works.

  • Are there any theoretical scientists,

  • computer scientists working on foundation

  • of these kinds of things.

  • What makes it converge, and what's not?

  • YANN LECUN: Yeah.

  • I mean, there's a lot of different types of people

  • working on those questions, some of them

  • are computer scientists, but many of whom

  • are either physicists or mathematicians.

  • So I've been--

  • I've been involved in an effort for many years

  • to try to get the applied math and pure math community

  • interested in those questions.

  • And I've only been successful in the last year or two.

  • Same for the physicists.

  • So basically, there are results in random matrix theory that

  • can be applied to the understanding

  • of the landscape of objective functions of those networks.

  • And it would seem to demonstrate,

  • to show that the number of [? settle ?]

  • points in those loss functions is combinatorally large.

  • But on the other hand, that there are--

  • although there might be a lot of local minima,

  • they're all pretty much of the same energy level.

  • So it doesn't matter which one you find.

  • And then there is empirical evidence to the fact

  • that the local minima are extremely degenerate.

  • So if you move in a large number of dimensions

  • around those local minima, the objective function

  • is essentially flat.

  • And there's a small number of directions where it's not flat.

  • That depends on the complexity of the problem.

  • And there's also empirical evidence

  • that [INAUDIBLE] showed in a paper, which is

  • that if you take two solutions.

  • So you start from two random initial conditions.

  • You train your neural net.

  • You get two different solutions.

  • Then you go straight line between the two.

  • And you barely go up.

  • And if you bend the past just a little bit,

  • then you can go from one minimum the other without going up.

  • So that tends to show that there's basically

  • only one minimum.

  • It's very degenerate.

  • And it's connected everywhere.

  • The intuition that we have, the usual intuition

  • of a local minimum in one dimension is completely wrong.

  • Building a box in a hundred million dimension

  • is very hard because you need a lot of walls.

  • So there's always going to be directions

  • where you can escape.

  • And that creates settle points.

  • So that's one thing.

  • And then there is work on generalization ability.

  • Like why do those things generalize the way they do,

  • even though they are way overparameterized.

  • There's an interesting paper.

  • One of the co-authors is Ben Recht from Berkeley

  • recently where they showed that you can take a ImageNet style

  • network, convolutional net.

  • You set the labels to completely random labels.

  • And those neural nets can still learn the training

  • set completely without errors.

  • One million training samples, they will just nail it,

  • 100% correct.

  • Of course, [? transition ?] error is chance.

  • But what that means is that there

  • is a huge amount of capacity in those networks

  • that they are able to recruit, if they need to.

  • But when you train them on things that make sense,

  • they don't have overfit that much.

  • They do overfit, but not ridiculously.

  • AUDIENCE: Hi.

  • So it seems like it's very clear that it's

  • important to have a strong predictive model of the world

  • to achieve intelligence.

  • But it also seems like there may be other components

  • to it, things such as creativity or metacognition.

  • So do you have any thoughts on how

  • we might achieve those other parts of intelligence?

  • YANN LECUN: So metacognition probably

  • is number 562 in the list of problems

  • we have to solve that maybe has 1,000 items so.

  • I'm not sure about that.

  • But creativity, I think those GANs actually exhibits

  • some level of creativity.

  • So there are people, for example,

  • at Rutgers, one of them is actually now

  • at Facebook, who used GANs to generate paintings,

  • abstract paintings in particular styles.

  • And they look really nice.

  • So that begs the question of is there,

  • what does creativity really mean?

  • We have a couple projects at Facebook

  • that I can't talk about yet, but soon, that involve also

  • creating kind of artistic artifacts using

  • those generative models.

  • And they look interesting.

  • People who actually are in the business of creating artifacts

  • are actually the impressed.

  • AUDIENCE: Hi.

  • I do some particles physics here.

  • I'm an undergrad.

  • And one of the big problems that we're

  • facing in implementing technologies like this

  • is that the data we have is collected almost

  • from a third person perspective where you have access

  • to all the variable information in three dimensions.

  • And so it's very hard to take a first person camera view

  • perspective of an event and try to pick apart what's going on.

  • What are the major computational challenges--

  • what's the difference between taking like a camera

  • view of these scenes and dissecting them

  • with a convolutional neural net versus somehow finding

  • an effective way of analyzing three dimensional information?

  • YANN LECUN: OK.

  • So a number of different answers there.

  • So first of all, there is quite a lot of interest

  • for the use of convolutional nets

  • in the context of high energy physics,

  • basically for trajectory filtering essentially,

  • so filtering events that are interesting.

  • I'm sure that's the kind of stuff you were thinking of.

  • I actually gave a talk at CERN maybe a couple years ago,

  • or a year and a half ago, and met a bunch

  • of people working on this.

  • And it's really expanding.

  • There's a colleague of mine at NYU called Kyle Cranmer who

  • has been working on this kind of stuff

  • actually using those GANs.

  • He's come up with good ideas on characterizing trajectories

  • of generating models of trajectories.

  • So that said, very often, those trajectories are in 3D.

  • And you'd like to be able to basically analyze them in 3D.

  • So you could use those 3D convolutional net

  • that I was talking about early in the middle of the talk.

  • They are sort of efficient for this,

  • because most of the voxels in a high energy physics experiments

  • are empty.

  • So you would like to be able to concentrate the computation

  • where things are relevant.

  • That's one thing.

  • The second thing is that there is

  • a new set of ideas I didn't talk about called graph

  • convolutional nets, or spectral networks.

  • So it's basically the idea that an image, a normal image, you

  • can think of an image as a function on a grid graph,

  • on a regular grid.

  • The pixels form a grid.

  • You can think of it as a graph where each pixel is connected

  • to its nearest neighbors.

  • And that indicates that--

  • it's just a reflection of the fact

  • that neighboring pixels are correlated.

  • Now imagine now that you have data

  • that comes to you not in kind of a flat grid graph,

  • but in a weird graph, like a cylinder or something,

  • like the calorimeter or in a high energy physics experiment,

  • or with some other set of sensors that is non-Euclidean.

  • You can actually define convolutions in those spaces.

  • And they're basically diagonal operators

  • in the graph Laplacian where the graph represents

  • the neighborhood relationships.

  • And so people have actually come up with ways

  • to apply convolutional nets to those non-Euclidean domains.

  • In fact, there is going to be a tutorial at NIPS

  • next week on precisely that topic in exactly one

  • week, Monday next week, which I'm a core speaker on.

  • But I'm actually going to speak.

  • There's going to be [INAUDIBLE].

  • AUDIENCE: You talked about--

  • sorry.

  • You talked about systems that both learn and reason.

  • And it seems to me like you argued that to get a strong AI,

  • you would need to do both of these things.

  • Now it seems to me like obviously humans do this.

  • But humans in a lot of ways are very dumb.

  • They make a lot of mistakes.

  • And they're very plastic.

  • And they need to learn to reason.

  • Whereas a lot of AI systems and reinforcement learning systems

  • do something very smart that takes

  • a lot of computational power.

  • And it's very much hard coded.

  • Do you think we'll see a trend towards dumber and more

  • plastic reasoning systems?

  • YANN LECUN: So I think most reinforcement--

  • Michael, correct me if I'm wrong.

  • But I think most reinforcement learning systems

  • that people are training today actually

  • are completely reactive.

  • They are very simple in terms--

  • I mean, there's very little actual reasoning.

  • Other than things like AlphaGo, AlphaGo Zero,

  • where there is tree exploration in the set of possible futures,

  • which is used for training.

  • Once it's trained, it actually just plays

  • without much tree exploration, actually.

  • So there's not a huge amount of reasoning there.

  • And that's a limitation not of reinforcement learning per se,

  • but of the architectures we use for all of our AI systems.

  • So I think what we consider I think

  • intelligent behavior involves this ability to predict.

  • In fact, I think the essence of intelligence

  • really is the ability to predict.

  • And so if you have a good model of the world that

  • is accurate for prediction, then you

  • can use it to plan a sequence of actions ahead

  • and perhaps moderate uncertainties about it.

  • And things like this.

  • So this is what reasoning really is about,

  • is predicting ahead what's going to happen, not necessarily

  • in time.

  • But also sort of simulating, so manipulating models.

  • Like when you think in your head about mathematics

  • or various other things, very often,

  • you have mental models that you manipulate.

  • They are simulators in a way.

  • You give them inputs, and they change.

  • And things like that.

  • That I think is really the essence

  • of reasoning and intelligence.

  • ROSSI LUO: Looking at the clock, it's 5:30.

  • I'm going to take one last question.

  • And if you have additional questions,

  • you probably just [? briefly ?] to [? floor ?] discussions

  • afterwards.

  • AUDIENCE: What's-- I'm not that familiar with deep learning

  • neural nets.

  • But I'm curious.

  • If I wanted to learn an object up

  • to something like affine transformations,

  • can I do transfer learning to do that?

  • Can you learn a whole group of transformations,

  • and then learn an object and then

  • have the object under those transformations?

  • YANN LECUN: So yes and no.

  • So if you take a convolutional net, for example,

  • and you train it on datasets like ImageNet that

  • have lots of different instances of the same objects and various

  • and things like this, it learns the notion

  • of object relatively independently of the viewpoint,

  • but not completely.

  • So it has to recognize a dog, whether it's a profile

  • view or a frontal view.

  • But if you take the head of the dog upside down,

  • it probably won't be able to recognize it.

  • The same way we have a hard time recognizing people

  • when their faces are upside down.

  • AUDIENCE: Not exclu-- little rotations,

  • shears, things like that.

  • YANN LECUN: Right, right.

  • So small rotation, shears, and scaling,

  • that that's handled by the pooling operation

  • in convolutional nets.

  • AUDIENCE: Right.

  • But there's nothing, no explicit geometric--

  • YANN LECUN: No.

  • There's no explicit 3D geometry.

  • And there is no real explicit 3D geometry,

  • except for the fact that whenever a feature is

  • detected in one location, it's also

  • detected in other locations.

  • And the fact that there is this pooling operation

  • that basically build a little bit of resist--

  • smoothness to variations of the location

  • of particular features.

  • So small variations of the position of elementary features

  • due to rotation, shear, and things like this,

  • will actually--

  • AUDIENCE: You're pooling them.

  • And that's why you're getting them.

  • But you're not explicitly modeling.

  • Same thing with Newtonian physics.

  • There's no built in physics yet, right?

  • YANN LECUN: Right.

  • There's-- no.

  • No built in physics.

  • AUDIENCE: Thank you.

  • ROSSI LUO: The main event I think is over.

  • And if you have additional questions,

  • you're welcome to briefly discuss with Professor Yamaka

  • afterwards.

  • And thanks [INAUDIBLE].

  • And let's give Professor Yann Lecun applause.

  • [APPLAUSE]

ROSSI LUO: Good afternoon.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it