Placeholder Image

Subtitles section Play video

  • YOSHUA BENGIO: [INAUDIBLE].

  • Thank you [INAUDIBLE].

  • So I'll talk about [INAUDIBLE].

  • I'll talk about representations and learning

  • representations.

  • And the word deep here, I'll explain what it means.

  • So my goal is to contribute to building intelligent machines,

  • also known as AI.

  • And how do we get a machine to be smart--

  • to take good decisions?

  • Well, it needs knowledge.

  • [INAUDIBLE]

  • [? researchers ?]

  • from the early days--

  • '50s, '60s, '70s--

  • tried to give the knowledge to the machine--

  • the knowledge we have exclusively.

  • And it didn't work quite as well as was hoped.

  • One reason is that a lot of our knowledge is not something

  • we can communicate verbally and we can

  • write that in a program.

  • So that knowledge has to be taken somewhere else.

  • And basically what we have found is you can get that

  • knowledge through observing the world around us.

  • That means learning.

  • OK-- so we need learning for AI.

  • What is learning?

  • What is machine learning?

  • It's not about learning things by heart.

  • That's just a fact.

  • What it is about is generalizing from the examples

  • you've seen to new examples.

  • And what I like to tell my students is it's taking

  • probability mass-- that is, on the training examples and

  • somehow guessing where it should go-- which new

  • configurations of the things we see make

  • sense or are plausible.

  • This is what learning is about.

  • It's guesswork.

  • At first we can measure [INAUDIBLE] we can guess.

  • And I'll mention something about dimensionality and

  • geometry that comes up when we think about this [INAUDIBLE].

  • And one of the messages will be that we can maybe fight

  • this [? dimensionality ?]

  • problem by allowing the machine to discover underlying

  • causes-- the underlying factors that explain the data.

  • And this is a little bit like [INAUDIBLE] is about.

  • So let's start from learning, an easy [INAUDIBLE]

  • of learning.

  • Let's say we observe x,y pairs where x is a number--

  • y is a number.

  • And the stars here represent the examples we've seen of x,y

  • configurations.

  • So we want to [? generalize ?] for new configurations.

  • In other words, for example, in this problem, typically we

  • want to predict a y given a new x.

  • And there's an underlying relationship between y and x,

  • meaning the expected value of the y given x, which is given

  • with this purple curve.

  • But we don't know it.

  • That's the problem with machine learning.

  • We're trying to discover something

  • we don't know already.

  • And we can guess some function.

  • This is the predicted or learned function.

  • So how could we go about this?

  • One of the most basic principles by which machine

  • learning algorithms are able to do this is assume something

  • very simple about the world around us-- about the data

  • we're getting or the function we're trying to discover.

  • It's just assuming that the function we're trying to

  • discover is smooth, meaning if I know the value of the

  • function that's come from the x, and I want to know the

  • value at some nearby point x prime, then it's reasonable to

  • assume that the value x prime of the function we want to

  • learn is close to the value of x.

  • That's it.

  • I mean, you can formalize that and [INAUDIBLE] in many

  • different ways and exploit it in many ways.

  • And what it means here is if I ask you why

  • should we at this point--

  • what I'm going to do is look up the value of y that I

  • observed at nearby points.

  • And combining these--

  • make a reasonable guess like this one.

  • And if I do that on problems like this, it's actually going

  • to work quite well.

  • And a large fraction of the applications that we're

  • sharing use this principle.

  • And [INAUDIBLE]

  • enough of just this principle.

  • But if we only rely on this principle virtualization,

  • we're going to be in trouble.

  • That's one of the messages I want to explain here.

  • So why are we going to be in trouble?

  • Well, basically we're doing some kind of interpolation.

  • So if I see enough examples--

  • the green stars here-- to cover the ups and down of the

  • function I'm trying to learn, then I'm going to be fine.

  • But what if the function I want to learn has many more

  • ups and downs than I can possibly observe through data?

  • Because even Google has a finite number of examples.

  • Even if you have millions or billions of examples, the

  • functions we want to learn for AI are not like this one.

  • They have--

  • the number of configurations of articles of interest-- that

  • may be exponentially large.

  • So something maybe bigger than the number of

  • items in the universe.

  • So there's no way we're going to have enough examples to

  • cover all the configurations.

  • For example, think of the number of different English

  • sentences, which is something that Google is interested in.

  • And this problem is illustrated by the so-called

  • curse of dimensionality where you consider what happens when

  • you have not just one variable but many variables and all of

  • their configurations.

  • How many configurations of [? N ?] variables do you have?

  • Well, you have an exponential number of configurations.

  • So if I wanted to learn about a single

  • variable, I can just divide--

  • [? it ?]

  • [? takes ?] a real variable.

  • And I divide its value into intervals.

  • And I count how many of those bins I've seen in my data.

  • I can estimate probability of different intervals coming up.

  • So that's easy Because i only want to know about a small

  • number of different configurations.

  • But if I'm looking at two variables, then the number of

  • configurations may be [INAUDIBLE]

  • [? square ?]

  • [? bigger, ?] and [? it'd ?] have [? 390-- ?] even more.

  • But typically, I'm going to have hundreds-- if you're

  • thinking about images, it's thousands-- tens of

  • thousands-- hundreds of thousands.

  • So it's crazy how many configurations there are.

  • So how do we possibly generalize to new

  • configurations?

  • We cannot just break up this space into small cells and

  • count how many things happen in each cell because the new

  • examples that we want to [? carry-- ?] new

  • configurations that [INAUDIBLE] asked about might

  • be in some region where we hadn't [INAUDIBLE].

  • So that's the problem of generalizing [INAUDIBLE].

  • So there's one thing that can help us, but it's not going to

  • be sufficient.

  • It's something that happens with the iPhones.

  • It's very often [INAUDIBLE] vision, [INAUDIBLE]

  • processing and understanding and many other problems where

  • the set of configurations of variables that are plausible--

  • that can happen in the real world--

  • occupy a very small volume of all this set of possible

  • configurations.

  • So let me give an example.

  • In images, if I choose the pixels in an image randomly--

  • in other words, if I sample an image from completely uniform

  • distribution, I'm going to get things like this.

  • Just [INAUDIBLE].

  • And I can repeat this for eons and eons.

  • And I'm never going to assemble something that looks

  • like a face.

  • So what it means is that faces--

  • images of faces--

  • are very rare in the space of images.

  • They occupy a very small volume, much less than what

  • this picture would suggest.

  • And so this is a very important hint.

  • It means that actually the task is to find out where this

  • distribution concentrates.

  • I have another example here.

  • If you take the image of a four like this one and you do

  • some geometry transformations to it like rotating it,

  • scaling it, you get slightly different images.

  • And if at each point, you allow yourself to make any of

  • these transformations, you can create a so-called manifold--

  • so a surface of possible images.

  • Each point here corresponds to a different image.

  • And the number of different changes that you make is

  • basically the dimensionality of this manifold.

  • So in this case, even though the data lives in the high

  • dimension space, the actual variations we care about are

  • of low dimensionality.

  • And knowing that, we can maybe do

  • better in terms of learning.

  • One thing about curves of dimensionality is I don't like

  • the name curves of dimensionality because it's

  • not really dimensionality.

  • You can have many dimensions but have

  • a very simple function.

  • What really matters is how many variations does the

  • function have-- how many ups and downs?

  • So we actually had some fairly [? cool ?] results about--

  • the number of examples you would need if you were only

  • relying on this [INAUDIBLE] assumption, essentially is

  • linear-- the number of ups and downs of the function

  • [INAUDIBLE].

  • So let's come back to this idea of learning where to put

  • probability [? math. ?]

  • So in machine learning, what we have is data.

  • Each example is a configuration of variables.

  • And we know that this configuration [? occurred ?]

  • in the real world.

  • So we can say the probability for this configuration.

  • So this is the [? space ?] of configuration

  • I'm showing in 2D.

  • So we know that this configuration is plausible.

  • [INAUDIBLE].

  • So we can just put a [? beacon ?]

  • [INAUDIBLE] here.

  • And we can put a [? beacon ?] at every example.

  • The question is how do we take this probability mass and sort

  • of give a little bit of that to other places.

  • In particular, we'd like to put mass in between if there

  • really was a manifold that has some structure and if we could

  • discover that structure, it would be great.

  • So the classical machine learning way of doing things

  • is say that the distribution function-- the function that

  • you're trying to [? learn ?] in this case is smooth.

  • So if it's very probable here, it must be also probable in

  • the neighborhood.

  • So we can just do some mathematical equation that

  • will shift some mass from here to the [? different ?]

  • neighbors.

  • Then we get a distribution like this as our model.

  • And that works reasonably well.

  • But it's not the right thing to do.

  • It's putting mass in many directions

  • we don't care about.

  • Instead, what we're going to do is to discover that there

  • is something about this data.

  • There is some structure.

  • There is some abstraction that allows us to be very specific

  • about where we're going to put probability mass.

  • And we might discover with something like this, which in

  • 2D doesn't look like a big difference.

  • But in high dimensions, the number of directions you're

  • allowed to move here is very small compared to the number

  • of dimensions here.

  • And the volume goes exponentially with dimension.

  • So you can have a huge [? gain ?] by guessing

  • probably which direction things move--

  • are allowed to keep high probability.

  • So, now to the core of this presentation which is about

  • representation learning.

  • I've talked about learning in general

  • and some of the issues--

  • some of the challenges with applying learning to AI.

  • Now, when you look at how machine learning is applied in

  • industry, what people do for 90% of time-- what they do

  • with the effort of engineers is not really

  • improve machine learning.

  • They use existing machine learning.

  • But to make the machine learning [INAUDIBLE] work

  • well, they do [INAUDIBLE]

  • feature engineering.

  • So that means taking the raw data and transforming it--

  • extracting some features-- deciding what matters--

  • throwing away the things that we think don't matter.

  • And that's essentially using humans and our intelligence

  • and our understanding of the problem to figure out the

  • factors that matter-- to figure out the dependencies

  • that matter and so on.

  • So what representation learning is about is trying to

  • do with machines what humans do right now, which is

  • extracting those features--

  • discovering what is a good representation for your data.

  • And one way to think about it is the machine

  • is trying to guess--

  • not just those features or those computations that are

  • useful for us to explain our [INAUDIBLE]

  • but really what are the underlying factors that

  • explain the [INAUDIBLE]?

  • What are the underlying causes?

  • And the guesses about what these are for our particular

  • example is exactly what we'd like to have as our

  • representation.

  • Of course, this is hard to define because we don't know

  • what the right factors are, what are the

  • right causes of it.

  • This is the objective we have.

  • This is [INAUDIBLE] by the way.

  • So there is a very important [? family ?] of algorithms as

  • [INAUDIBLE] mentioned [INAUDIBLE] that have been

  • around since at least [? those ?]

  • that have multiple levels like this that have been around

  • since the '80s.

  • And they have multiple layers of [? computations. ?]

  • And one of the things I've been trying to do is to find

  • some properties that they have that other algorithms may have

  • that may be useful and try to understand why these

  • properties are useful.

  • In particular, there's the [INAUDIBLE] of depth.

  • So the idea of deep learning is that not only are you going

  • to have representations of the data [INAUDIBLE]

  • learned.

  • But you're going to have multiple levels of

  • representation.

  • And why would it matter to have multiple levels of

  • representation?

  • Because you're going to have low level and high level

  • representations where high level representations are

  • going to be more abstract--

  • more nonlinear--

  • capture structure that is less obvious in the data.

  • So what we call deep learning is when the learning algorithm

  • can discover these representations and even

  • decide how many levels there should be.

  • So I mentioned [INAUDIBLE]

  • as the original example of deep learning.

  • What these algorithms do is they learn some computation--

  • some function that takes an [INAUDIBLE] vector and map it

  • to some output, which could be a vector, through different

  • levels of representation where each level is composed of

  • units which do a computation that's inspired by how the

  • neurons in the brain work.

  • So they have a property which you don't find in many

  • learning algorithms called [INAUDIBLE].

  • So let's first see how these other [INAUDIBLE]

  • numbers work--

  • how they generalize.

  • Remember, this is going to be very similar to when I talked

  • about [INAUDIBLE]

  • assumption.

  • They rely on this [INAUDIBLE] assumption.

  • Deep learning also relies on this [INAUDIBLE] assumption

  • but introduces additional [INAUDIBLE]-- additional

  • knowledge, if you will.

  • So when you only rely on this [INAUDIBLE] assumption, the

  • way you work is you essentially take your

  • [? input ?] space--

  • [INAUDIBLE] space, and break it up into regions.

  • For example, this is what happens with clustering,

  • nearest neighbors, any SVMs, any classical statistical

  • non-parametric algorithms, decision trees and

  • [? so on. ?]

  • So what happens is after seeing the data, you break up

  • the [INAUDIBLE] space into regions, and

  • you generalize locally.

  • So if you have a function that outputs something here--

  • because you've seen an example here for example--

  • you can generalize and say, well in the neighborhood, the

  • output is going to be similar and maybe some kind of

  • interrelation with the neighboring regions is going

  • to be performed.

  • But the crucial point from a mathematical point of view is

  • that there's a counting argument here, which is how

  • many parameters-- how many degrees of freedom do we have

  • to define this partition?

  • Well, basically, you need at least one

  • parameter per region.

  • The number of parameters is going to grow with the number

  • of regions.

  • See-- if I want to distinguish two regions, I need to say

  • where the first one is or how to separate between these two.

  • And for example, [INAUDIBLE]

  • specifying the center of each region.

  • So the number of things I have to specify from the data is

  • essentially equal to the number of regions I can

  • distinguish.

  • So you can think well, there's no other way you could do

  • that, right?

  • I mean, how could you possibly create a new region for each

  • [INAUDIBLE] not see any data, and distinguish it

  • meaningfully.

  • Well, you can.

  • Let me give you an example.

  • So this is what happens with distributed representations,

  • which happens with things like factor models, PCA, RBM,

  • neural nets, sparse coding, and deep learning.

  • What you're going to do is you're going to still break up

  • the [INAUDIBLE] space at the regions and be able to

  • generalize locally in a sense that things that are nearby

  • are going to have similar outcomes.

  • But the way you're going to learn that

  • is completely different.

  • So, for example, you can [INAUDIBLE]

  • space.

  • What I'm then doing is that I'm going to break it down in

  • different ways that are not mutually exclusive.

  • So, here, what I'm thinking about when I'm building this

  • model is there are three factors that explain the two

  • inputs that I'm seeing.

  • So this is two-dimensional input space.

  • And I'm bringing the space into different regions.

  • So, for example, the black line here tells me that you're

  • either on that side of it or the other side of it.

  • On that side, it's [? T1 ?] equals 1.

  • On that side, it's [? T1 ?] equals 0.

  • So this is a bit that tells me whether I'm in

  • this set or this set.

  • And I have this other data that tells me whether I'm on

  • this set and this set, or that set.

  • And now you can see that the number of regions I've defined

  • in this way could be much larger than the number of

  • parameters.

  • Because the number of parameters was [? beginning ?]

  • with a number of factors-- the number of these [INAUDIBLE].

  • So by being smart about how we define those regions by

  • allowing the [INAUDIBLE] to help us, you can get

  • potentially exponential gain in expressive power.

  • Of course, from the machine learning point of view, this

  • comes with an assumption.

  • The assumption is that when I learn about being on that side

  • or that side, it's meaningful [INAUDIBLE] in some sense--

  • not quite in a statistical sense--

  • of what happens with the other configurations--

  • the other half of it.

  • So that makes sense if you think of, OK, this is images.

  • And this one is telling me is this a male or a female?

  • This one's telling me, does he wear glasses or not?

  • Is he tall or short, something like that.

  • So if you think about these factors as [INAUDIBLE]

  • meaningful things, usually, you can vary them [INAUDIBLE],

  • like the causes that explain the world around us.

  • And that's why you're able to generalize.

  • You're assuming something about the world that gives you

  • a kind of exponential power of representation.

  • Now, of course, in the new world, the features we care

  • about, the factors we care about are not going to be

  • simple, linear, or separated.

  • So that's one reason why we need deep representations.

  • Otherwise, just the same old [? level ?] will be enough.

  • Let me move on because time is flying.

  • So this is stolen from my brother, Samy, who gave a talk

  • here not long ago where they used this idea of observations

  • in a very interesting way where you have data of two

  • different modalities.

  • You have images.

  • And you have text queries--

  • short sequence of words.

  • And they learned representation for images, so

  • they map the image to some hyperdimensional vector and

  • they learn a function that represents queries.

  • So they map query through this also hyperdimensional point in

  • the same space.

  • And they learn them in such a way that when someone types

  • "dolphin" and then is shown an image of a dolphin and then

  • clicks on it, the representation for the image

  • and the representation for the query end up

  • close to each other.

  • And in this way, once you learn that, you can of course

  • [INAUDIBLE] things like answering new queries you've

  • never seen and find images that match queries that

  • somehow you haven't seen before.

  • One question that people outside of machine learning

  • ask when they've considered what machine learning are

  • doing is this is crazy.

  • Humans can learn from very few examples.

  • And you guys need thousands or millions of examples.

  • I mean, you're doing something wrong.

  • And they're right.

  • So how do humans manage to constantly learn something

  • very complicated from just a few examples?

  • Like, how do students learn something?

  • Well, ther are a number of answers.

  • One is brains don't start from scratch.

  • They have some priors.

  • And in particular, I'm interested in generic priors

  • that allow us to generalize the things that [INAUDIBLE]

  • didn't train our species to do.

  • But still they do very well.

  • So we have some very general purpose priors we are born

  • with, and I'd like to figure out which they are because we

  • can exploit them as well.

  • Also-- and this is very, very important--

  • if you ask a newborn to do something, it

  • wouldn't work very well.

  • But of course, an adult has learned a lot of things before

  • you give him a few examples.

  • And so he's transferring knowledge from [INAUDIBLE].

  • This is [? crucial. ?]

  • And the way he's doing that is he's built in his mind

  • representations of the objects-- of the types of the

  • modalities which are given in the examples.

  • And these representations capture the relationships

  • between the factors-- the explanatory factors that

  • explain what is going on in your particular

  • setup of the new task.

  • And he's able to do that from unlabeled data--

  • from examples that were unrelated to the task we're

  • trying to solve.

  • So one of the things that humans are able to do is to do

  • what's called semi-supervised learning.

  • They're able to use examples that are not specifically for

  • the task you care about to generalize.

  • They are people who use information about the

  • statistical structure of the things around us to

  • [? greatly ?] answer new questions.

  • So here, let's say someone gives me just two examples.

  • We want to discriminate between the

  • green and the blue.

  • And the classical algorithm would do something like put a

  • straight line in between.

  • But what if you knew that there are all these other

  • points that are not [INAUDIBLE]

  • related to your task.

  • But these are the configurations that are

  • plausible in the [INAUDIBLE] distribution.

  • So those [INAUDIBLE] ones, you don't know if they

  • are green or blue.

  • But by the structure here, we guess that these ones are all

  • blue and these ones are all green.

  • And so you would put your decision like this.

  • So we're trying to take advantage of data from other

  • tasks that are unable to find something generic about the

  • world like [INAUDIBLE] usually happen in this direction and

  • use that to quickly generalize from very few

  • examples to new examples.

  • So of the motivations for learning about depth, there

  • are [? vertical ?] motivations that come from the discovery

  • of families of functions-- mathematical functions--

  • that can be represented very efficiently if you--

  • [? the ?]

  • [? longer ?] representations have longer levels that might

  • require exponentially more numbers--

  • bigger representations--

  • if you're only allowed one or two levels.

  • Even though one or two levels are enough to [? observe ?]

  • any function, it might be very inefficient.

  • And of course, there are biological motivations, like

  • the brain seems to have the [INAUDIBLE]

  • [? picture. ?]

  • [? It's especially ?] true of the visual cortex, which is

  • the part we understand best.

  • And that the cortex seems to have a generic learning

  • algorithm which the principles seem to be at work in terms of

  • learning everywhere in the cortex.

  • Finally, there are cognitive motivations [INAUDIBLE].

  • We learn simpler things first.

  • And then we [? compose ?] to these simpler things to build

  • high level abstractions.

  • This has been exploited, for example, in the work of

  • [INAUDIBLE]

  • [? Stanford-- ?]

  • by [INAUDIBLE] and [INAUDIBLE] and others to show how

  • [INAUDIBLE] representations can learn simple things like

  • [? edges-- ?]

  • combine [? them ?]

  • [? from ?] parts-- combine them to form faces

  • and things like that.

  • Another sort of simple motivation is how do you

  • program computers.

  • Do we program computers by having a main program that has

  • a bunch of lines of code?

  • Or do we program computers by having functions or

  • subroutine, [? like call ?] subroutine, [? then call ?]

  • subroutine.

  • This is [? the new ?] program.

  • If we were forced to program that way, it

  • wouldn't work very well.

  • But most of machine learning is basically trying to solve

  • the [INAUDIBLE] in this--

  • not in the programs they use but in the structure of the

  • functions that are learned.

  • And there are also, of course, motivations from looking at

  • what can be achieved by exploiting depth.

  • So I'm stealing this slide from another Google [? talk ?]

  • led by Geoff Hinton last summer, which shows how deep

  • nets, compared to the standard way, which has been the

  • state-of-the-art in speech recognition for 30 years, can

  • be substantially improved by exploiting these multiple

  • levels of representation--

  • even-- and this is something new that impressed me a lot--

  • even when the amount of data available is huge, the gain in

  • using these representations is-- representation learning

  • algorithms.

  • And this all comes from something that happened in

  • 2006 when first Geoff Hinton followed by a group here in

  • Montreal and [INAUDIBLE] group in NYU in New York found that

  • you could actually train your deep neural network by using a

  • few simple tricks.

  • And the simple tricks essentially that we're going

  • to train layer by layer using [INAUDIBLE]

  • learning, although recent work now allows us to train deep

  • networks without this trick and using other tricks.

  • This has given rise to lots of industrial interest, as I

  • mentioned--

  • not only in [INAUDIBLE] conditions but also in

  • [INAUDIBLE], for example.

  • I'm going to talk about some competitions we've won using

  • deep learning.

  • So, last year we won sort of a transfer learning competition,

  • where you were trying to take the representations, learn

  • from some data, and apply them on other data that relates to

  • similar but different tasks.

  • And so there was one competition where the results

  • were announced at ICML 2011--

  • [INAUDIBLE]

  • [? 2011 ?] and another one at NIPS 2011.

  • So this is less than a year ago.

  • And what we see in those pictures is how the

  • [INAUDIBLE] improves with more layers.

  • But what precisely each of these graphs has on the

  • x-axis, a lot of the number of [? label ?] examples used for

  • training the machine.

  • And the y-axis is [INAUDIBLE]

  • essentially, so you want this to be [? high. ?]

  • And for this task, as you add more levels of representation,

  • what happens is you especially get better in the case where

  • you have very few [? label ?] examples-- the thing I was

  • talking about that humans can do so well--

  • generalize from very few examples.

  • Because they've learned the representation earlier on

  • using lots of other data.

  • One of the learning algorithms that came out of my lab that

  • has been used for this is called the denoising

  • auto-encoder.

  • And what it does-- in principle, it's pretty simple.

  • And to learn representation, you take each input example

  • and you corrupt it by, say, saying some of the [INAUDIBLE]

  • zero [INAUDIBLE].

  • And then you learn a representation so that you can

  • reconstruct the info.

  • But you want to construct the uncorrupted info-- the clean

  • info-- that's why it's called denoising.

  • And then you try to make this as close

  • as possible to [? it. ?]

  • I mean, this is close as possible to the [? raw, ?]

  • uncorrupted info.

  • And we can show this essentially models the density

  • of the [INAUDIBLE] distribution.

  • And you can learn these representations and stack them

  • on top of each other.

  • How am I doing with time?

  • MALE SPEAKER: 6:19.

  • YOSHUA BENGIO: Huh?

  • MALE SPEAKER: 6:19.

  • YOSHUA BENGIO: I have until [? when? ?]

  • MALE SPEAKER: [? Tomorrow ?]

  • [? morning. ?]

  • MALE SPEAKER: As long as [INAUDIBLE].

  • MALE SPEAKER: Just keep going.

  • YOSHUA BENGIO: OK.

  • [INAUDIBLE].

  • [INTERPOSING VOICES]

  • [LAUGHING]

  • YOSHUA BENGIO: OK, so I [INAUDIBLE] here a connection

  • between those denoising auto-encoders and the manifold

  • learning idea that I was mentioning earlier.

  • So how do these algorithms discover the manifolds-- the

  • regions where the configurations [INAUDIBLE] the

  • variables are plausible-- where the distribution

  • concentrates.

  • So, we're back on the same picture as before.

  • So these are our examples.

  • And what we're trying to do is to learn a representation.

  • So mapping from the [? info ?] space [INAUDIBLE] here that we

  • [INAUDIBLE] to a new space, such that we can essentially

  • recover the input-- in other words, we don't lose

  • information.

  • But at the same time because of the denoising part,

  • actually, you can [? show that ?] what this is

  • trying to do is throw away all the information.

  • So it seems crazy but if you want to keep all the

  • information, but you want to throw away all the

  • information.

  • But there's a catch.

  • Here, you want to only be able to

  • reconstruct these examples--

  • not necessarily any configuration if inputs.

  • So you're trying to find the function which will preserve

  • the information for these guys.

  • In other words, it's able to reconstruct them

  • [? by the identity ?] function.

  • But it's applied on these guys.

  • But when you apply it in other places, it's allowed to do

  • anything it wants.

  • And it's also learning this [? new ?] function.

  • So in order to do that, let's see what happens.

  • Let's consider a particular point here--

  • particular example.

  • It needs to distinguish this one from its neighbor.

  • In the representation, [INAUDIBLE].

  • The representation you learn from that guy has to be

  • different enough from that guy that we can actually recover

  • and distinguish this one from this one.

  • So we can learn an inverse mapping, an approximate

  • inverse mapping, from the representation.

  • So that means you have to have a representation which is

  • sensitive to changes in that direction.

  • So when I move slightly from here to here, the

  • representation has to change slightly as well.

  • On other hand, if I move in this direction, then the

  • representation doesn't need to capture that.

  • It could be constant as I move in that direction.

  • In fact, it wants to be constant in all directions.

  • But what's going to happen is it's going to be constant in

  • all directions except directions that it actually

  • needs to reconstruct the data and in this way, recover the

  • directions that are the derivatives of this

  • representation function.

  • And you recover the directions of the manifold-- the

  • directions where if I move in this direction, I still stay

  • in regional [? high ?]

  • [? probability. ?]

  • That's what the manifold really means.

  • So we can get rid of this direction.

  • And recently, we came up with an algorithm that you can use

  • to sample from the model.

  • So if you have an understanding of the manifold

  • as something that tells you at each point, these are the

  • directions you're allowed to move--

  • so as we stay in high probability [INAUDIBLE]

  • So these are the directions that keep you [? tangent ?] to

  • the manifold, then basically, the algorithm goes, well, we

  • are at a point.

  • We move in the directions that our algorithm discovered to be

  • good directions of change-- plausible

  • directions of change.

  • And that might correspond to something like taking an image

  • and translating it or [? updating it ?] or doing

  • something like removing part of an image.

  • And then projecting back towards the manifold-- it

  • turns out that the reconstruction

  • function does that.

  • And then [? integrating ?] that random [? wall ?]

  • to get samples to the model.

  • And we apply this to modeling faces and digits.

  • Now, let's come back to this question of what is a good

  • representation?

  • People in computer [? vision ?] have used this

  • term invariance a lot.

  • And it's a word that's used a lot when

  • you handcraft features.

  • So, remember, at the beginning, I said the way most

  • of machine learning is applied is you take your raw data, and

  • you handcraft features based on your knowledge of what

  • matters and what doesn't matter.

  • For example, if your input is images, you'd like to design

  • features that are going to be insensitive to

  • translations of your info.

  • Because typically, the category you're trying to

  • detect should not depend on a small translation.

  • So this is the idea of comparing features.

  • But if we want to do unsupervised learning, where

  • no one tells us ahead of time what matters and what doesn't

  • matter-- what the task is going to be, then how do we

  • know which invariance matters?

  • For example, let's say we're doing speech recognition.

  • Well, if you're doing speech recognition, then you want to

  • be invariant to who the speaker is.

  • And you want to be invariant to what kind of microphone it

  • is and what's the volume of the sound.

  • But if you're doing speaker identification, and you want

  • to be invariant to what the person says and you want to be

  • very sensitive to the identity of the person.

  • But if someone gives you speech.

  • And you don't know if it's going to be used for

  • recognition of [INAUDIBLE] or for recognizing people, what

  • should you do?

  • Well, what you should be doing is learning

  • to disentangle factors--

  • basically, discovering that in speech, the things that matter

  • are the [INAUDIBLE], the person, the

  • microphone, and so on.

  • These are the factors that you'd like to discover

  • dramatically.

  • And if you're able to do that, then my claim is you can

  • essentially get around the curse of dimensionality.

  • You can solve very hard problems.

  • There's something funny that happens with the deep learning

  • algorithms I was talking about earlier, which is that if you

  • train these representations from purely ununsupervised

  • learning, you discover that the features-- the

  • representation that they find have some form of

  • disentanglement--

  • that some of the units in the [INAUDIBLE]

  • are very sensitive to some of the underlying factors.

  • And they're very sensitive to one factor and very

  • insensitive to other factors.

  • So this is what disentangling is about.

  • But knowing all these algorithms what those factors

  • would be in the first place.

  • So something good is happening.

  • But what don't really understand why.

  • And we'd like to understand why.

  • One of the things that you see in many of these algorithms is

  • the idea of so-called sparse representations.

  • So what is that?

  • Well, up to now, I've talked about representations as just

  • a bunch of numbers that we associate to an input.

  • But one thing we can do is learn representations

  • [? that have ?]

  • [? the property-- ?]

  • that many of those numbers happen to be

  • zero or some constant--

  • [? other value. ?]

  • But zero is very convenient.

  • And it turns out, when you do that, it helps a lot, at least

  • for some problems.

  • So that's interesting.

  • And I conjecture that it helps us disentangle the underlying

  • factors in the problems where basically--

  • for any example, there are only a few concepts and

  • factors that matter.

  • So in the scene that I see right now that comes to my

  • eyes, of all the concepts that my brain knows about, only a

  • few are relative to this scene.

  • And it's true of almost any input

  • that comes to my sensors.

  • So it makes sense to have representations that have this

  • property as well-- that even though we have a large number

  • of possible features, most of them are sort of not

  • applicable to the current situation.

  • Not applicable, in this case, zero.

  • So just by forcing many of these features to ouput not

  • applicable, somehow we're getting better

  • representations.

  • This has been used in a number of papers.

  • And we've used it with so-called rectifier neural

  • networks, in which the unit [? compute ?] a function like

  • this on top of the usual linear

  • transformation they perform.

  • And the result is that this function [INAUDIBLE]

  • 0 [INAUDIBLE].

  • So when x here is some weighted sum from the previous

  • layer, what happens is either the output is a positive real

  • number or the output is 0.

  • So let's say the input was a sort of random centered around

  • 0, then half of the time, those features would output 0.

  • And if you just learn to shift this a little bit to the left,

  • then you know--

  • 80% of the time or 95% of the time, the output will be 0.

  • So it's very easy to get sparsity with these kind of

  • [INAUDIBLE].

  • It turns out that these [INAUDIBLE] are sufficient to

  • learn very complicated things.

  • And that was used in particular in a really

  • outstanding system built by Alex Krizhevsky and

  • [INAUDIBLE] and [INAUDIBLE]

  • with Geoff Hinton in Toronto recently where they obtained

  • amazing results on one of the benchmarks that computer

  • vision people really care about [INAUDIBLE]

  • with [? 1,000 ?] classes.

  • So this contains millions of images taken from Google Image

  • search and 1,000 classes that you're trying to classify.

  • So these are images like this.

  • And there are 1,000 different categories you want to detect.

  • And this shows some of the outputs of this model.

  • [INAUDIBLE] obviously doing well.

  • And they managed to bring the state-of-the-art from making

  • small incremental changes from say 27% to 26% down to 17% on

  • this particular benchmark.

  • That's pretty amazing.

  • And one of the tricks they used--

  • I've been doing publicity for--

  • is called dropouts.

  • And Geoff Hinton is speaking a lot about this this year.

  • Next year will be something else.

  • And it's a very nice trick.

  • Basically, the idea is add some kind of randomness in the

  • typical neurons we use.

  • So you'd think that randomness hurts, right?

  • So if we learn a function like, you know-- say, thinking

  • about the brain doing something.

  • If you had noise in the computations of the brain,

  • you'd think it hurts.

  • But actually, when you do it during training, it helps.

  • And it helps for reasons that are yet to be completely

  • understood.

  • But the theory is it prevents the features you learned to

  • depend too much on the presence of the others.

  • So, half of the features will be turned off by this trick.

  • So the idea is you take the output of a neuron, and you

  • multiply it by 1 or 0, [INAUDIBLE]

  • probably be one-half.

  • So you turn off half of the features [INAUDIBLE].

  • We do that for all the layers.

  • And [INAUDIBLE] at this time, you don't do

  • this kind of thing.

  • You just multiply it by [INAUDIBLE].

  • So it averages the same thing.

  • But what happens is that during training, the features

  • learn to be more robust and more independent of each other

  • and collaborate in a less fragile way.

  • This is actually similar to the denoising auto-encoder I

  • was talking about earlier where we introduce corruption

  • noise in the input.

  • But here, you do it at every layer.

  • And somehow, this very simple trick helps

  • a lot in many contexts.

  • So they've tested it on different benchmarks.

  • These are three image data sets and also in speech.

  • And in all cases, they've seen improvements.

  • Let's get back to the representation learning

  • algorithms.

  • Many of them are based on learning one layer of

  • representation at a time.

  • And one of the algorithms that has been very [? practical ?]

  • for doing that is called a Restricted

  • Boltzmann Machine, or RBM.

  • And as a probability model, it's formulized this way.

  • Basically, we're trying to model this [INAUDIBLE]

  • of the vector--

  • x-- which is a [INAUDIBLE] vector of [? bits ?]

  • typically.

  • But it could be real numbers.

  • And we introduce a vector of [? bits ?]

  • [? h. ?]

  • And we consider the joint distribution of these vectors

  • [INAUDIBLE] formula.

  • We're trying to find the parameters in the formula--

  • the b, the c, and the w.

  • So that P of x is as large as possible.

  • In terms of that in this model that's been very popular for

  • deep learning, you need to sample from the model.

  • In other words, the model is representative of

  • distribution.

  • And you'd like to generate examples according to what the

  • model thinks is plausible.

  • And in principle, there are ways to do that.

  • You can do things like Gibbs sampling.

  • However, what we and others have found is that these

  • [? sampling ?] algorithms based on the particular

  • algorithm [INAUDIBLE] chain methods have some quirks.

  • They don't do exactly what we'd like.

  • In particular, we say they don't mix well.

  • So what does that mean?

  • For example, if you start a chain of samples--

  • so you're going to create a sequence of samples by making

  • small changes-- that's what Monte Carlo Markov chain

  • methods do--

  • well, it turns out that you get chains like this where it

  • stays around the same kind of examples.

  • So it doesn't move to a new category.

  • So you'd like your sample or your [INAUDIBLE] algorithm to

  • be able to visit all the plausible configurations and

  • be able to jump from one region in configuaration space

  • to another one, or at least have a chance to visit all the

  • places that matter.

  • But there's a reason why this is happening.

  • And I'm going to try to explain it in a picture.

  • So, first of all, as I mentioned, MCMC methods move

  • in configuration space by making small steps.

  • You start from a configuration pf the examples.

  • Let's say where I'm standing is a configuration of x,y

  • coordinates.

  • And I'm going to make small steps, such that if I'm in a

  • configuration of high probability, I'm going to move

  • to another high probability configuration.

  • And if I'm in a low probability configuration, I'm

  • going to tend to move to a neighboring high probability

  • configuration.

  • In this way, you stay in sort of high probability regions.

  • And you, in principle, can visit the whole distribution.

  • But you can see there's a problem.

  • If this one thing here is highly probable.

  • And the black thing there is highly probable, and the gray

  • stuff in the middle is very unplausible, how could I

  • possibly make small moves to go from here to here from the

  • white to the black?

  • So this is illustrating the picture like this.

  • In this case, this is the density.

  • OK-- so the input is representing different

  • configurations of the variables of interest.

  • And this is what the model thinks the

  • distribution should be.

  • So it gives high probability some places.

  • So these are what we call modes.

  • It's a region that has a peak.

  • And this is another mode.

  • So this is [INAUDIBLE] two modes.

  • The question is can we go from mode to mode and make sure to

  • visit all the modes?

  • And the MCMC is making small steps.

  • Now, [INAUDIBLE]

  • is the MCMC can go through these [INAUDIBLE] regions.

  • And they have enough probability--

  • it can move around here and then quickly go through these

  • and do a lot of steps here and go back in this way-- assemble

  • all the configurations with [INAUDIBLE]

  • probability.

  • The problem is--

  • remember, I said in the beginning, that there's this

  • geometry where the [INAUDIBLE]

  • problems have this property--

  • that the things we care about-- the images occupy a

  • very small volume in the space of configurations of pixels.

  • So the right distribution that we're trying to learn is one

  • that has very big piece where there's a lot of probability.

  • And in most other places, the probability will be tiny,

  • tiny, tiny--

  • exponentially small.

  • So, we're trying to make moves between these modes.

  • But now these modes are separated

  • by vast empty spaces--

  • deserts of probability where it's impossible to cross

  • unless you make huge jumps.

  • So that's a really big problem.

  • Because when we consider algorithms like the RBM,

  • what's going on is for learning, we need to sample

  • from the model.

  • Initially, when the model starts learning, it says I

  • don't know anything.

  • I'm assigning a kind of uniform [? probability ?] to

  • everything.

  • So the model thinks everything is uniform-- the probability

  • is the same for everything.

  • So we seem to move everywhere.

  • And as it keeps learning, it starts developing these

  • peaks-- these modes.

  • But still there's a way to go from mode to mode and go

  • through reasonably probable configurations.

  • As learning becomes more advanced, you

  • see these peaks emerge.

  • And now, it becomes impossible to cross.

  • Unfortunately, we need the sampling to learn these

  • algorithms.

  • And so there's a chicken and egg problem.

  • We need the sampling.

  • But if the sampling doesn't work well, the learning

  • doesn't work well.

  • And so we can't make progress.

  • So the one thing I wanted to talk about is a direction of

  • solution for this problem that [? we are ?] starting to

  • explore in my lab.

  • And it involves exploiting--

  • guess what--

  • deep representations.

  • The idea is instead of doing these steps in the original

  • space of the inputs where we observe things, if we did the

  • MCMC in abstract, high level representation, maybe things

  • would be easier.

  • So let's consider, for example, something [INAUDIBLE]

  • images and digits.

  • If we had a machine that had discovered that the factors

  • that matter for these images is something like, OK, is the

  • background black and foreground

  • white or vice versa.

  • So this is one [? bit ?] that says flip black and white.

  • And is the category zero, one, two, three, so that's just 10

  • [? bits ?] that tell us what the category is.

  • And what's the position of the digit in the image.

  • So these are high level factors that you could imagine

  • that learned or discovered.

  • And if it would discover these things, then if you represent

  • the image in that space, the MCMC would be much easier.

  • In particular, you could go from, say, one of these guys

  • to these guys directly simply because maybe these are the

  • zeros and these are the threes.

  • And there's one bit that allows you to flip-- or two

  • bits that allow you to flip from zero to three.

  • So in the space where my representation has

  • [? a big fuzzy ?] row of [INAUDIBLE] three--

  • I just need to flip two bits.

  • And that's easy.

  • It's a small move in that space.

  • In a space of abstract representations, it's easy to

  • generate data whereas the original space it's difficult.

  • One way to see this visually is to interpolate between

  • examples at different levels of representation.

  • So this is what we've done.

  • So if you look in the pixel space, and you interpolate

  • between this line-- this picture of a line--

  • this picture of [INAUDIBLE]

  • [? linear ?] interpolation, what you see is between the

  • image of a line and the image of a three, you have to go in

  • between through images that don't look like anything--

  • I mean, don't look like a digit-- the things that the

  • model has seen-- and so the MCMC--

  • if it [? walks-- ?]

  • oh, that's a plausible thing.

  • Oh, no, this is not very plausible.

  • This is worse.

  • [? I'm ?] coming back.

  • So it's never going to go on the other side of this desert

  • of probability.

  • So this is a one-dimensional [INAUDIBLE] of the example I

  • was trying to explain.

  • Now, what you see with the other two lines is the same

  • thing but at different levels of representation that has to

  • do with [INAUDIBLE] using unsupervised learning.

  • And what you see is that it has learned a representation

  • which has kind of skewed the space so that somehow, I can

  • make lots of small changes and stay in three.

  • And suddenly, just a few pixels flip and it becomes a

  • nine magically.

  • And so I don't have to stay very long in the row

  • [INAUDIBLE] region.

  • In fact, it's not even so implausible.

  • And you can move--

  • all of these moves are rather plausible.

  • So you can smoothly move from mode to mode without having to

  • go through these low probability regions.

  • So it's not like it actually discovered these actual bits.

  • But it's discovered something that makes the

  • job of sampling easier.

  • And we've done experiments to validate that.

  • So the general idea is instead of sampling in the original

  • space, we're going to learn these representations and then

  • do our sampling which iterates between representations in

  • that high level space.

  • And then once we've found something in that abstract

  • representation, because we have inverse mappings, we can

  • map back to the input space and get, say, the digit we

  • care about or the face we care about.

  • So we've applied this, for example, with face images.

  • And what we find is that, for example, the red here uses a

  • deep representation.

  • And the blue here uses a single layer RBM.

  • And what we find is that for the same number of steps it

  • can visit more modes-- more classes--

  • by adding a deeper representation.

  • So I'm almost done.

  • What I would like you to keep in mind is that machine

  • learning involves really interesting [INAUDIBLE]

  • challenges that have a geometric nature in

  • particular.

  • And in order to face those challenges, something we found

  • very useful is to allow machines to look for

  • abstraction-- to look for high level

  • representations of the data.

  • And I think that we've only scratched the

  • surface of this idea--

  • that the algorithms we have now discover still rather low

  • level abstractions.

  • There's a lot more that could be done if we were able to

  • discover an even higher level of abstractions.

  • Ideally, what the high level abstractions do is

  • disentangle--

  • separate out--

  • the different underlying factors that explain the

  • data-- the factors we don't know but we'd like the machine

  • to discover.

  • Of course, if we know the factors, we can somehow cheat

  • and give back to the machine by telling it, here are some

  • random variables that we know about.

  • And here are some values of these variables in such and

  • such setting.

  • But you're not going to be able to do that for

  • everything.

  • We need machines that can make sense of the world by

  • themselves to some extent.

  • So we've shown that more abstract representations give

  • rise to successful transfers-- so being able to generalize

  • new domains, new languages, new classes.

  • And one [INAUDIBLE]

  • [? computation ?] is using these tricks.

  • I'm done.

  • Thank you very much.

  • [APPLAUSE]

  • YOSHUA BENGIO: Before I take questions, I would like to

  • thank members of my team, so [INAUDIBLE].

  • [INAUDIBLE] of course very fortunate to have for my work.

  • And I'm open to questions.

  • Yes?

  • There's a microphone.

  • AUDIENCE: Hi.

  • So the problem you mentioned with the Gibbs sampling--

  • isn't that easily solved by [INAUDIBLE]?

  • YOSHUA BENGIO: No.

  • We've tried that.

  • And [INAUDIBLE].

  • AUDIENCE: [INAUDIBLE]

  • be maybe [INAUDIBLE]--

  • YOSHUA BENGIO: So the problem-- what happens is if

  • you [INAUDIBLE] restart, what typically can happen is your

  • log is going to bring you to a few of the modes--

  • always the same ones.

  • And so you're not going to visit everything.

  • AUDIENCE: Well, why would it bring you

  • always to the same mode?

  • YOSHUA BENGIO: Because it's like a [? dynamical ?] system.

  • So somehow most [? routes ?] go [? to Rome ?]

  • for some reason.

  • [INAUDIBLE].

  • AUDIENCE: Well, [? if ?]

  • [? you ?]

  • [? did ?]

  • [? say ?]

  • Paris, you'll go to Paris.

  • YOSHUA BENGIO: Most routes go to a few big cities.

  • That's what happens in these algorithms.

  • We obviously tried this because it's-- you think well,

  • that should work.

  • Uh--

  • and there's a question right here.

  • AUDIENCE: So I was just wondering about

  • sampling from the model.

  • I thought that was interesting the way you had different

  • levels of abstraction.

  • You could sort of bridge that space.

  • So you're able to visit more classes.

  • But does that make the sort of distinction

  • between those classes?

  • It seems you're bringing those classes closer together.

  • So in terms of [INAUDIBLE]--

  • YOSHUA BENGIO: We had trouble understanding what

  • was going on there.

  • Because you'd think if you're--

  • [INAUDIBLE]--

  • think if somehow we made the threes and the nines closer,

  • it should now be harder discriminating from them.

  • Whereas before, we had this big empty region where we

  • could put our separator.

  • So I don't have a complete answer for this.

  • But it's because we are working with these high

  • dimensional spaces that these are not necessarily as our

  • intuition would suggest.

  • What happens really is that the manifold--

  • so the regions where, say, the threes are is a really

  • complicated, curvy surface in high dimensional space.

  • And so is the nine.

  • And in the original space, these curvy spaces are

  • intertwined in complicated ways, which mean machinery has

  • a hard time separating between nines and threes, even though

  • there's lots of spacing between nines and threes.

  • It's not like you have nines here and threes here.

  • It's a complicated thing.

  • And what we do when we move to these high level

  • representations is flatten those surfaces.

  • And now if you interpolate between points, you're kind of

  • staying in high probability configurations.

  • So this flattening also means that it's

  • easier to separate them.

  • Even though they may be closer, it's easier because

  • you [? need ?] a [? simpler ?]

  • surface is enough.

  • This is a conjecture I'm making.

  • I haven't actually seen [INAUDIBLE]

  • spaces like this.

  • This is what my mind makes of it.

  • AUDIENCE: So my understanding is that you don't even include

  • [INAUDIBLE] algorithm of what the representations

  • [INAUDIBLE].

  • YOSHUA BENGIO: We would like to have algorithms that can

  • learn from as little [? cue ?] as possible.

  • AUDIENCE: So on some of your [? research ?]

  • [INAUDIBLE] my understanding is that you worked on

  • [INAUDIBLE]

  • [? commission. ?]

  • YOSHUA BENGIO: Yes.

  • AUDIENCE: Did you try to understand what were the

  • representations that the algorithms produced?

  • YOSHUA BENGIO: Yes.

  • AUDIENCE: And did you try from that to understand if there

  • were units where using [INAUDIBLE] similar

  • representations of your representations would be

  • different between these [INAUDIBLE]

  • units?

  • YOSHUA BENGIO: Well, first of all, we don't

  • know what humans use.

  • But we can guess, right?

  • So, for example, I showed early on the results of

  • modeling faces from Stanford--

  • and other people have found the same, but

  • I'll use their picture--

  • that these deep learning algorithms discover

  • representations that, at least for the first two levels, seem

  • to be similar to what we see in the visual cortex.

  • So if you look at [? V1, ?]

  • [INAUDIBLE]

  • the major first area of the visual cortex where

  • [INAUDIBLE]

  • [? arrive, ?]

  • you actually see neurons that detect or are sensitive to

  • exactly the same kinds of things.

  • And in the second layer--

  • and [? V2 ?] is not a layer really-- it's an area--

  • well, you don't see these things.

  • But you'll see combinations of images.

  • And [INAUDIBLE] same group that is actually compared what

  • the neurons in the brain seem to like according to

  • neuroscientists with the machine learning models

  • discover and they found some similarities.

  • Other things you can do that I mentioned quickly was--

  • in some cases, we know what the factors are.

  • We know what humans will look for.

  • So you can just try to correlate the features that

  • have been learned with the factors you know humans think

  • are important.

  • So we've done that in the work here with [INAUDIBLE].

  • That's not the one I want to show you.

  • It is the right guys, but--

  • sorry.

  • Yes-- this one.

  • So for example, we've trained models on the problem of

  • sentiment analysis where you give them a sentence and you

  • try to predict the person liked or didn't like

  • something, like a book or a video or a [INAUDIBLE]--

  • something--

  • something you find on the web.

  • And what we found is that when we use purely unsupervised

  • learning-- so it doesn't know that the job is to present

  • [INAUDIBLE] analysis.

  • Some of the features specialize on the sentiment.

  • Is this [INAUDIBLE] more positive or more negative kind

  • of statement.

  • And some features specialize on the domain.

  • Because we train this across 25 different domains.

  • So some basically detect or are highly correlated with--

  • is this about books-- is this about food-- is this about

  • videos-- is this about music?

  • So these are underlying factors we know are

  • present in the data.

  • And we find [INAUDIBLE]

  • features tend to specialize toward these things, much more

  • than the original.

  • AUDIENCE: [SPEAKING FRENCH]

  • YOSHUA BENGIO: So I'm going to summarize your question in

  • English and answer in English [INAUDIBLE].

  • So my understanding of the question is, could we just not

  • use our prior knowledge to build in the property and

  • structure [INAUDIBLE] representations and so on?

  • And my answer to this is of course because

  • this is what we do.

  • And this is what everybody in machine learning does.

  • And it's especially true in computer vision where we use a

  • lot of prior knowledge of our understanding

  • of how vision works.

  • But my belief is that it's also interesting to see

  • whether machines could discover these things by

  • themselves.

  • Because if you have algorithms that can do that, then we

  • can-- we can still use our prior knowledge.

  • But we can discover other things that we didn't know or

  • that we were not able to formulize.

  • And with these algorithms, actually, it's not too

  • difficult to put in prior knowledge.

  • You can add [INAUDIBLE]

  • variables that correspond to the things you know matter and

  • you can put extra terms in the [INAUDIBLE] model that

  • correspond to your prior knowledge.

  • You can do all these things.

  • And some people do that.

  • If I work with industrial partner, I'm going to use all

  • the knowledge that I have because I want to have

  • something that works in the next six months.

  • But if I want to solve AI, I think it's worth it to explore

  • more general purpose methods.

  • It could be combined with the prior knowledge.

  • But to make the task of discovering these general

  • purpose methods easier and focus on this aspect, I find

  • it interesting to actually avoid lots of very specific

  • human knowledge.

  • Although, different researchers look at this

  • differently.

  • AUDIENCE: [SPEAKING FRENCH]

  • YOSHUA BENGIO: You can do all kinds of things.

  • There's lots of freedom to combine our knowledge and

  • there are different ways.

  • And it's a subject of many different papers--

  • to combine our knowledge with learning.

  • So we can do that.

  • Sometimes, though, when you put too much prior knowledge,

  • it hurts the [INAUDIBLE].

  • AUDIENCE: So how much does the network [INAUDIBLE] matters?

  • And isn't that kind of the biological priors you were

  • talking about in the very early slide?

  • And isn't that the new kind of feature

  • engineering of deep learning--

  • figuring out the right [INAUDIBLE].

  • YOSHUA BENGIO: So the answer is no to

  • both of these questions.

  • [INAUDIBLE].

  • For example, the size of those layers doesn't matter much.

  • It matters in the sense that they have to be big enough to

  • capture the data.

  • And regarding the biology-- what-- the--

  • AUDIENCE: Yeahm because I was thinking, right--

  • our brain comes wired in certain ways because we have

  • the visual cortex and [INAUDIBLE] that.

  • So--

  • YOSHUA BENGIO: There might be things we will learn that

  • we'll be able to exploit that might be generic

  • enough in the brain.

  • And we're on the lookout for these things.

  • MALE SPEAKER: We'll take two more questions.

  • AUDIENCE: So I'm interested in the results that [INAUDIBLE]

  • variations [INAUDIBLE].

  • So that reminds me of [INAUDIBLE]

  • we try to avoid [INAUDIBLE].

  • But [INAUDIBLE] come with a lot of [INAUDIBLE]

  • scheduling and such.

  • So [INAUDIBLE].

  • YOSHUA BENGIO: No--

  • I mean, you can see this trick is so simple-- like in

  • [INAUDIBLE]

  • one nine.

  • So there's nothing complicated in this particular trick.

  • You just add noise in a very dumb way.

  • And somehow, it makes these networks more robust.

  • AUDIENCE: Well, [INAUDIBLE].

  • YOSHUA BENGIO: It's different from [INAUDIBLE].

  • It doesn't serve the same purpose.

  • AUDIENCE: You're also not reducing [INAUDIBLE]--

  • AUDIENCE: [INAUDIBLE].

  • YOSHUA BENGIO: Sorry?

  • AUDIENCE: When you're trying to [INAUDIBLE].

  • energy [INAUDIBLE].

  • YOSHUA BENGIO: Yeah-- so you're trying to minimize the

  • error under the [INAUDIBLE] created by this [INAUDIBLE].

  • So there is [INAUDIBLE].

  • AUDIENCE: And it's a very rough [? connection ?]

  • at this point.

  • But then for [INAUDIBLE], it's not just [? two ?]

  • random things.

  • You also have temperature decreasing and the end of the

  • [INAUDIBLE].

  • YOSHUA BENGIO: So, here, it's not happening in the space of

  • parameters that we are trying to optimize.

  • So in [INAUDIBLE] trying to do some optimization.

  • Here, the noise is used as a regularizer, meaning it's

  • injecting sort of a prior that your neuro net should be kind

  • of robust to half of it becoming dead.

  • It's a different way of using [INAUDIBLE].

  • FEMALE SPEAKER: And the last question.

  • AUDIENCE: In the beginning, you were talking about priors

  • that humans have in their brain.

  • YOSHUA BENGIO: Absolutely.

  • AUDIENCE: OK-- so how do you exploit

  • that in your algorithm?

  • Or do you exploit that?

  • So it's a general question [INAUDIBLE].

  • It's not so precise.

  • YOSHUA BENGIO: Well, I could speak for hours about that.

  • AUDIENCE: Yeah, OK-- but how do you exploit that in your

  • algorithms?

  • Or what kind [INAUDIBLE]?

  • YOSHUA BENGIO: Each prior that we consider is basically

  • giving rise to a different answer to your question.

  • So in the case of the sparsity prior I was mentioning, it

  • comes about simply by adding a term in the [INAUDIBLE] adding

  • a prior explicitly.

  • That's how we do it.

  • In the case of the prior that--

  • I mentioned the prior that input distribution tells us

  • something about the task.

  • And we get it by combining unsupervised learning and

  • supervised learning.

  • But in the case of the prior that--

  • there are different abstractions that matter--

  • different levels of abstraction in the world

  • around us, we get the prior by adding a structure in the

  • model that has these different methods of representation.

  • There are other priors that I didn't mention.

  • For example, one of the priors is [INAUDIBLE]

  • study as the constancy prior or the [? slowness-- ?]

  • that the factors that matter that explain the world around

  • us, some of them change slowly over time.

  • The set of people in this room is not changing very

  • quickly right now.

  • It's a constant over time.

  • Eventually, it will change.

  • But there are properties of the world around us that

  • remain the same for many time stamps.

  • And this is a prior, which you can also incorporate in your

  • model by changing the training criteria to say something like

  • well, some of the features in your representation should

  • stay the same from [? team to ?]

  • [? team ?]

  • [INAUDIBLE].

  • This is very easy to put in and has been done and

  • [INAUDIBLE].

  • So each prior--

  • you can think of a way to incorporate it.

  • Basically, it's changing the structure or changing the

  • training criteria usually, is the way we do it.

  • AUDIENCE: And which kind of prior do you

  • think we humans have?

  • I guess it's a very complex question.

  • [INTERPOSING VOICES].

  • YOSHUA BENGIO: Basically, the question you're asking is the

  • question I'm trying to answer.

  • So I have some guesses.

  • And I mentioned a few already.

  • And our research is basically about finding out what these

  • priors are that are generic and work for many tasks in the

  • world around us.

  • AUDIENCE: OK.

  • Thank you.

  • YOSHUA BENGIO: Welcome.

  • MALE SPEAKER: [INAUDIBLE] and ask one last question.

  • As some of our practitioners are not machine learning

  • experts, do you think it's worthwhile for us to learn a

  • bit about machine learning--

  • YOSHUA BENGIO: Absolutely.

  • [LAUGHTER]

  • MALE SPEAKER: What's the starting point where we can--

  • we have a small problem.

  • We want to spend one month working on this kind of thing.

  • Where would be the starting point for us?

  • YOSHUA BENGIO: You have one month?

  • [LAUGHING]

  • MALE SPEAKER: So is there a library or something

  • [INAUDIBLE]?

  • YOSHUA BENGIO: You should take Geoff Hinton's [INAUDIBLE]

  • course.

  • You can probably do it in one month because you're

  • [INAUDIBLE].

  • AUDIENCE: There was also another one by Andrew

  • [INAUDIBLE].

  • YOSHUA BENGIO: Yes.

  • Andrew [INAUDIBLE] has a very good course.

  • And there are more and more resources on the web to help

  • people get started.

  • There are libraries that people share, like the library

  • from my lab that you can use to get started quickly.

  • There are all kinds of resources like that.

  • AUDIENCE: [INAUDIBLE] to it on a standard computer, or you

  • need a cluster or--

  • YOSHUA BENGIO: It helps to have a cluster for training.

  • Once you have the training model, usually you can run it

  • on your laptop.

  • And the reason you need a cluster is that these

  • algorithms have [INAUDIBLE]

  • many [? knobs ?]

  • in the set.

  • And you want to explore many configurations of these knobs.

  • But actually training one model can be

  • on a regular computer.

  • It's just that you want to try many

  • configurations of these knobs.

  • MALE SPEAKER: Thank you very much, Yoshua.

  • YOSHUA BENGIO: You're welcome.

  • MALE SPEAKER: --for this interesting talk.

  • [APPLAUSE]

  • [MUSIC]

YOSHUA BENGIO: [INAUDIBLE].

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it