Subtitles section Play video Print subtitles >> So it's my pleasure to introduce to you Geoff Hinton, who is a pioneer in machine learning and neural nets, and more recently [INDISTINCT] architectures. And then, I think that's going to be topic of today. So take it over. >> HINTON: Okay. So I gave it to Okeer a couple of years ago. And the first 10 minutes or so will be an overview of what I said there, and then I'll talk about the new stuff. The new stuff consists of a better learning module. It allows you to learn better in all sorts of different things, like, learning how images transform, learning how people walk, and learning object recognition. So the basic learning module consists of some variables that represent things like pixels, and these will be binary variables for that. Some variables that represent--these are latent variables, they're also going to be binary. And there's a bipartite connectivity, so these guys are connected to each other. And that makes it very easy if I give you the states of the visible variables to infer the states of the hidden variables. They're all independent given the visible variables because it's a non-directed graph. And the input procedure just says, the probability of turning on hidden unit "hj" given this visible vector "v" is the logistic function of the total what he gets from his audience, so very simple for the hidden variables. Given the hidden variables, we can also infer the visible variables very simply. And if we want some--if we put some weights on the connections and we want to know what this model believes, we can just go back and then forward inferring all the hidden variables in parallel than all the visible ones. Do that for a long a time, and then you'll see examples of the kinds of things it likes to believe. And the end of learning is going to be to get it to like to believe the kinds of things that actually happen. So this thing is governed by an energy function that has given the weights on the connections. The energy of a visible plus a hidden vector is the sum overall connections of the weight if both the visible and hidden units are active. So I'm going to pick some of the features that are active, you adding the weight, and if it's a big positive weight, that's low energy, which is good. So, it's a happy network. This has nice derivatives. If you differentiate it with respect to the weights, you get this product of the visible and hidden activity. And so, that the derivative is going to show up a lot in the learning because that derivative is how you change the energy of a combined configuration of visible and hidden units. The probability of a combined configuration, given the energy function, is E to the minus the energy of that combined configuration normalized by the partition function. And if you want to know the probability of a particular visible vector, you have to sum all the hidden vectors that might go with it and that's the probability of visible vector. If you want to change the weights to make this probability higher, you always need to lower the energies of combinations of visible vector on the hidden vector that would like to go with it and raise the energies of all other combinations, so you decrease the computation. The correct maximum likelihood learning rule that is if I want to change the weights so as to increase the log probability, that this network would generate the vector "v" when I let it just sort of fantasize the things it like to believe in is a nice simple form. It's just the difference of two correlations. So even though it depends on all the other weights, it shows up this is difference of correlations. And what you do is you take your data, you activate the hidden units, that's to classify the units, and then we construct, activate, we construct activate. So this is a mark of chain. You run it for a long time, so you forgot where you started. And then you measure the correlation there, start with the correlation here. And what you're really doing is saying, "By changing the weights in proportion to that, I'm lowering the energy of this visible vector with whatever hidden vector it chose. By doing the opposite here, I'm raising the energy, the things I fantasize." And so, what I'm trying to do is believe in the data and not believe in what the model believes in. Eventually, this correlation will be the same as that one. In most case, nothing will happen because it will believe in the data. In terms that you can get a much quicker learning algorithm where you can just go on and [INDISTINCT] again, and you take this difference of correlations. Justifying that is hard but the main justification is it works and it's quick. The reason this module is interesting, the main reason it's interesting is you can stack them up. That is for accompanied reason you're not going to go into it, it works very well to train the module then take that activities of the feature detectors, treat them as so they were data, and train another module on top of that. So the first module is trying to model what's going on in the pixels by using these feature detectors. And the feature detectors would tend to be highly correlated. The second model is trying to model a correlation among feature detectors. And you can guarantee that if you do that right, every time you go up a level, you get a better model of the data. Actually, you can guarantee that the first time you go up a level. For further levels, all you can guarantee is that there's a bound on how good your model of the data is. And every time we add another level, that bound improves if we had it right. Having got this guarantee that something good is happening as we add more levels, we then violate all the conditions of mathematics and just add more levels in sort of [INDISTINCT] way because we know good things are going to happen and then we justify by the fact that good things do happen. This allows us to learn many lesser feature detectors entirely unsupervised just to model instruction of the data. Once we've done that, you can't get that accepted in a machine learning conference because you have to do discrimination to be accepted in a machine learning conference. So once you've done that, you add some decision units to the top and you learn the connections discriminatively between the top-layer features and the decision units, and then if you want you can go back and fine-tune all of the connections using backpropagation. That overcomes the limit of backpropagation which is there's not much information in the label and it can only learn on label data. These things can learn on large amounts of unlabeled data. After they've learned, they you add these units at the top and backpropagate from this small amount of label data, and that's not designing the feature detectors anymore. As you probably know at Google, designing feature detectors is the art of things and you'd like to design feature detectors based on what's in the data, not based on having to produce labeled data. So the edge of backpropagation was design your feature detectors so you're good at getting the right answer. The idea here is design your feature detectors to be good at modeling whatever is going on in the data. Once you've done that, just have a so slightly fine-tune and so you better get right answer. But don't try and use the answer to design feature detectors. And Yoshua Bengio's lab has done lots of work showing that this gives you better minima than just doing backpropagation. And what's more minima in completing different part of the space? So just to summarize this section, I think this is the most important slide in the talk because it says, "What's wrong with million machine learning up to a few years ago." What people in machine learning would try to do is learn the mapping from an image to a label. And now, it would be a fine thing to do if you felt that images and labels are rows in the following way. The stuff and it gives rise to images and then the images give rise to the labels. Given the image, the labels don't depend on the stuff. But you don't really believe that. You only believe that if a label is something like the parity of the pixels in the image. What you really believe is the stuff that gives rise to images and then the labels that goes with images because of the stuff not because of the image. So it's a cow in a field and you say cow. Now, if I just say cow to you, you don't know whether the cow is brown or black, or upright or dead, or far way. If I show an image of the cow, you know all those things. So this is a very high bandwidth path, this is a very low bandwidth path. On the right way to associate labels with images is to first learn to invert this high bandwidth path. And we can currently do that because vision works basically. The first store you look at then, you see things. And it's not like it might be a cow, it might be an elephant, it might be electric theater. Basically, you get it right nearly all the time. And so we can invert that pathway. Having learned to do that, we can then learn what things are called. But you get the concept of a cow not from the name, but from seeing what's going on in the world. And that's what we're doing and then later as I say from the label. Now, I need to do one slight modification to the basic module which is I had binary units as the observables. Now, we want to have linear units with Gaussian noise. So we just change the energy function of it. And the energy now says, "I got a kind of parabolic containment here." Each of these linear visible units has a bias which is like its mean. And it would like to sit here and moving away from that [INDISTINCT] energy. The parabola is the negative log of the Gaussian [INDISTINCT]. And then the input that comes from the hidden units, this is just vi, hj, wij, but Vs have to be scaled by the standard deviation of the Gaussian there. If I ask--if I differentiate that with respect to a visible activity, then what I get is hj, wij divided by the sigma I. And that's like an energy gradient. And what the visible unit does when you reconstruct is it tries to compromise between wanting to sit around here and wanting to satisfy this energy gradient, so it goes to the place where this two gradients [INDISTINCT] opposite and you have--that's the most likely value and then you [INDISTINCT] there. So with that small modification we can now deal with real value data with binary latent variables and we have an efficient learning algorithm that's an approximation of [INDISTINCT]. And so we can apply it to something. So it's a nice speech recognition task that's been well organized by the speech people where there's an old database called TIMIT, it's got a very well-defined task for phone recognition where what you have to do is you're given the short window speech, you have to predict the distribution, the probability for the central frame of the various different phones. Actually, each phone is modeled by 3-state HMMs, sort of beginning middle and end, so you have to predict for each frame is it the beginning middle or end of each with the possible phones, there's a 183 of those things. If you give it a good distribution there to sort of focus on the right thing then all the post-processing will give you back where the phoning bandwidth should be and what your phone arrow radius, and that's all very standard. Some people use tri-phone models. We're using bi-phone models which aren't quite as powerful. So now we can test high goodwill by taking 11 frame of speech. It's 10 milliseconds per frame but each frame is looking at like 25 milliseconds of speech and predicting the phone at the middle frame. We use the standard speech representation which is mel-cepstral coefficients. There's 30 of those, and there are differences and differences and difference, differences; and we feed them in to one of these deep nets so. So here's your input, 11 frames and 39 coefficients. And then--I was away when the student did this and he actually believed what I said. So he thought adding lots and lots of hidden units was a good idea. I've started it too. But he added lots of hidden units all unsupervised, so all this green connections are learned without any use of the labels. He used to bottleneck there, so the number of Reg connections will be relatively small. These are not--these have to be learned using discriminative information. And now you're back propagating the correct answers through this whole net for about a day on a GPU board or a month on a core, and it does very well. That is the best phone error rate we got was 23%. But the important thing is whatever configuration you use, how many hidden layers as long as they are plenty and whatever widths and whether you use this bottleneck or not, it gets between 23% and 24%. So it's very robust to the exact details of how many layers and how wide they are. On the best previous result on TIMIT for things that didn't use speaker adaptation is 24.4% and that was averaging together on lots of models, so this is good. >> So each of these layers that's four million weights? >> HINTON: Yup, four million weights. So we're only training one, two, three, one, two, three, we're training, you know, about 20 million weights. Twenty million weights is about 2% of a cubic millimeter of cortex. I think so, this is a tiny brain. The last, probably, all you need for [INDISTINCT] recognition. >> Why did they start with the differences and double differences of the MFCCs that you're going into a thing that could learn to do that itself if they wanted to? >> HINTON: That's a very good question 'cause you are sitting at the end. It's an extremely good question because the reason I put the differences and double differences is so they can model a data with a diagonal co-variance metric--diagonal co-variance model--and you can't model the fact that overtime two things turn to be very much the same where modeling co-variance is, unless you actually put the differences into the data and you model the differences directly. So it allows you to use a model that conquered with co-variances. Later on we're going to show a model that conquered with co-variances and then we are going to do what the client always said you should do, which is throw away the mel-cepstral representation and use the better representation in speech. >> I said that? >> HINTON: Yes, you said that to me the last time [INDISTINCT]. >> Smart guy. >> HINTON: Okay, so the new idea is to use a better kind of module. This module already works pretty well, right? You know, it does well at forming recognition, it does well in all sort of things. It can't model multiplicative interactions very well. It can model anything with enough training data, but its not happy modeling multiplies. It multiplies all over the place. I'll show you a bunch of place where you need multiplies. Here's the sort of main example of why you need multiplies. Supposed I want to, from a high level of description of an object the name of the shape and it's pose, size, position orientation. So, first of all, I want to generate the parts of an object and I want them to be related correctly to each other. I could use very accurately top down model that says none of this square and none pose gram, because I generate each piece in exactly the right position; that would require high bandwidth. Or I could be sloppy and I could say, "I need to generate this side, not sort of a representation and distribution of where this side might be. And I'll generate corners and other sides and they're all a bit sloppy. And if I picked one thing from each distribution, it would make a nice square. But I could also top-down specify how these things should be pieced together. In effect, I can specify a macro form of field; this is what goes with what? And then I can clean this up knowing these distributions and pick a square like that. Of course, I might sometimes pick a square that has slightly different orientation or slightly different size, but it'll be a nice clean square because I know how they go together. And so that's a much more powerful kind of generation model, and that's what we want to learn to do, and so we are going to need hidden units up here to specify interactions between visible units here, as opposed to just specifying input of visible units. There's an analogy for this, which is, if I'm an officer and there's a bunch of soldiers and I want them to stand in the square, I could get out my GPS and I can say, "Soldier number one, stand at this GPS coordinates. And soldier number two, stand at these GPS coordinates." Now, if I use enough digits, I'll get a nice neat rectangle, or I could say, "Soldier number one, stand roughly around here. And then soldier number two, hold your arm out and stand this distance from soldier number one." And that's a much better way to get a neat rectangle. It will cause far less communication. So what you're doing is you're downloading roughly where people should stand and then how they should relate to each other. We have to specify the relations not just where they should be. And that's what we'd like in a powerful [INDISTINCT] model. So, we're going to aim to get units in one layer to say how units in the layer below should latterly interact when you generate it. It's going to turn out you don't need to worry about these lateral interactions when you recognize it. When you generate, you do. To do that, we're going to need things called third-order Boltzmann machines, which has three lane tractions. So Terry Sejnowski, he pointed out a long time ago that we have an energy function like this where this was V and this was H, but these are just binary variables. And we could probably well write down an energy function like this with three things in tract, then we have a three layer weight. And if you think about these three things now, K, the state of K is acting like a switch. When K is on, you effectively have this weight between I and J. When K is off, this weight disappears. And it happens every which way because it's symmetric. So using an energy function like this, we can allow one thing to specify how two other things should interact. So each hidden unit can specify a whole mark of random field over the pixels if you want. But that sort of begins to make you worry because a mark of random field has a lot of parameters in it. And if you start counting in the sits here, if you have any of these and any of those and none of those, you get enqueued to these parameters which is rather a lot. If you're willing to use enqueued parameters, you can now make networks to look like this. Suppose I have two images and I want to model how images transform over time, unless suppose I'm just moving random dots around, have a pan of random dots and I translate it. Well, if I see that dot and I see that dot, that's some evidence for a particular translation. And so if I put a big positive weight there, this triangle is meant to interrupt that big three-way weight. Then when this and this around, they'll say it's very good to have this guy up. It would have been nice at low energy state. If I also see this pair of dots, I'll get more routes, though, and this guy should be--and I will turn this guy on. If however this pixel went to here, I'll go for this guy. And if this pixel also went to there but this guy--so these guys are going to represent coherent translations of the image, and it's going to be able to use these three-way weights to take two images and extract two units that represent the coherent translation. It'll also be able to take the pre-image and the translation, and compute with pixel should be on here. Now what we're going to do is take that basic model and we're going to factorize it. We're going to say, "I've got these three-way weights and I've got too many of them." So, I'm going to represent each three-way weight as the product of three two-way things. I'm going to introduce these factors and each factor is going to have these many parameters which is just pro-factor, is just a linear number of parameters. If I have about N factors, I end up with only N square of these weights. And if you think about how pixels transform in a new image, they don't do random permutations. It's not that this pixel goes on that one, goes here. Pixels do sort of consistent things, so I don't really need enqueued parameters because I'm just trying to model these fairly consistent transformations, which is a limited number, and I should be able, too, in many less parameters. And this is the way to do it. So, that's going to be our new energy function given the bias terms. One way of thinking about how modeling a weight is I want these tensor of three-way weights. If I take an IF product and two vectors like this, I'll get a matrix that has rank one. So I get a three-way product. I'll get a tensor that has rank one. And if I now add up a bunch of tensors like that so each factor now, each F, specifies a rank one tensor, by adding up a bunch of them, I can model any tensor I like if I use N square factors. If I use any N factors, I can model most regular tensors but I can't model arbitrary permutations, and that's what we want. If you ask how does inference works now, inference is still very simple in this model. So here's a factor. Here's the weights connecting it to, say, the pre-image. Here's the weights connecting it to the post-image. Here's the weights connecting it to the hidden units. And to do inference what I do is this. Supposed I only have that one factor. I would multiply the pixels by these weights; add all that up so I get a sum of this vertex. I do the same here; I get a sum of this vertex. Then I multiply these two sums together to get a message going in the center of the hidden units. And as that message goes to the hidden unit, I multiply it by the weight on my connection. And so what the hidden unit will see is this weight, turns the product of these two sums, and that is the derivative of the energy with respect to the state of this hidden unit, which is what it needs to know to decide whether to be on or off, it wants to go into what other state below the image. And all the hidden units remain independent even though I've got these multipliers now. So this is much better than putting in another stochastic binary unit here. If I put a stochastic binary unit in here, the hidden units would cease to be independent and inference will get tough. But this way, whether a deterministic factor that's taking a part of these two sums, inference remains easy. The learning also remains easy. So this is the message that goes from factor F to hidden unit H, and that message is the product that we got of these two lower vertices; the product of the sums, the compute on the pre-image and the post-image. And the way you learn the weight on the connection from factor F to hidden unit H is by changing the weight so as to lower the energy when you're looking at data, and raise the energy when you're constructing these from the model or just reconstructing things from the hidden units you got from data. And those energy groups, they just look like this. They're just the product of the state of the hidden unit and the message that goes to it when you're looking at data and the state of the hidden unit and the message that goes through it when you're looking at samples from the model or reconstructions. So it's still a nice pair-wise learning rule. So everything is pair-wise still, so you might fit into the brain. Now, if we look what one of these factors does when I show random dot patterns to translate, then we can look at the weights connecting it to the pre-image, and that's a pattern of weights where white is a big positive weight, black is a big negative weight because that would have a learned a great in connecting it to the pre-image and this will have learned a great in connecting it to the post-image. With a hundred factors, I'll show you what Roland learned. So, those are the hundred the hundred factors connecting--these are the receptive fields at the factors in the pre-image. And remember it's looking at translating dots, and these are the factors in the post-image. And you see, it's basically learned the freer basis and it's learned to translate things by about 90 degrees. And that's a very good way of handling translation. Mathematicians say things like, "The freer basis is a natural basis for modeling translation." I don't really know what that means, but just learn the freer basis on that. And if you get rotations, it learned the different basis. So this is the basis that learns for rotations. You see it learns about yin and yang here. Oops [INDISTINCT]. Okay, that's the basis for rotations. One other thing you could do is train it just on single dot pans and translating in a coherent way and then test it on two overlaid dot patterns but they're translating different directions. It's never seen that before. It's only been trained on coherent motion where we're going to test it on what's called transparent motion. In order to see what it thinks, when we train the uncivilized, there's no labels anywhere, we never tell it what the notions are, we need some way to seeing what it's thinking, so we add a second hidden layer that looks at the hidden units representing transformations and it's fairly sparse. So the units on that second hidden layer will be tuned to particular directions of motion. And then to see what it's thinking, we take the directions those units like weighted by how active those units are and I will tell you what directions it thinks it's seeing. Now when you show it transparent motion and you look at those units in the second hidden layer, if the two motions are within about 30 degrees, it sees a single motion of the average direction. If they're beyond about 30 degrees, it sees two different motions and once more they're repelled from each other. That's exactly when I was with people, and so this is exactly how the brain works. Okay. There's going to be a lot of that kind of reasoning in this talk. I'm going to on to time series models now. So, we'd like to model not just static images, for example, we like to model video. To be [INDISTINCT] we're going to try something a bit simpler. When people do time series models, you would nearly always like to have a distributed non-linear representation, but that's hard to learn. So people tend to do dumb things like Hidden Mark up Models or Linear Dynamical Systems which either give up on the distributed or on the non-linear, but are easy to doing inference. What we're going to come up with is something that has the distributed and the non-linear and is easy to do inference, but the linear algorithm isn't quite right but it's good enough. It's just an approximation to make some [INDISTINCT]. And the inference also is ignoring the future and just basing things on the past. So, here's a basic module, and this is with just two-way interactions. This is the Restricted Boltzmann Machine with visible units and hidden units. Here are the previous visible frames. These are all going to be linear units. And so, these blue connections are conditioning the current visible values on previous observed values in a linear way. So, it's called an autoregressive model. The hidden units here are going to be binary hidden units; they're also conditioned on previous visible frames, and learning is easy in this model. What you do is you take your observed data, and then given the current visible frame and given the previous visible frames, you got import to the hidden units, they're all independent given the data, so you can separately decide what states they should be in. Once you fixed states for them, you now reconstruct the current frame using the input you're getting from previous frames and using the top you got in from the hidden units. After we construct, you then activate the hidden units again. When you say the difference in the power statistics with data here and the reconstructions here to learn these weights and you take the difference on activities of these guys with data with reconstructions to get signal that you can used to learn these weights or these weights. So learning is straightforward and it just depends on differences, and you can learn a model like this. After you've learned it, you can generate from the model by taking some previous frames. These inputs, the conditioning inputs, in effect, fixed the biases of these to depend on the previous frames. So, these are the dynamic biases, and with these biases fixed, you just get backwards and forwards for awhile and then pick a frame there, and that's your next frame you regenerated, then you keep going. So, we can generate from the model once it learns so we can see what it believes. >> You always go back two steps in time or is that just an example? >> HINTON: Sorry. >> Oh, you were just going back only two steps in time? >> HINTON: No, we're going to get back more sets in time. >> Okay, and you let... >> HINTON: I just got lazy with the PowerPoint. Now, one direction we could go from here is to higher level models. That is, having learned this model where these hidden units are all independent given the data, we could say--well, what I've done is I've turned the visible frames into the hidden frames map. And it turns out you can get a better model if you take these hidden frames, a model what's going here, and now you put in conditioning connections between the hidden frames and more hidden units that don't have conditioning here that don't interact with other hidden unit. [INDISTINCT] in this model. Then you can prove that if you do this right, then you'll get a better model of the original sequences or your improver band on the model of the original sequences. So you can [INDISTINCT] lots of layers like that. And when you have more layers, it generates better. But I'm going to go in a different direction. I'm going to show you how to do it with three-way connections. And we're going to apply it to motion-capture data, so you put reflective markers on the joints, you have lots of infrared cameras, you figure out where the joints are in space. You know the shape of the body so you go backwards through that to figure out the joint angles and then the frame of data is going to consist of 50 numbers, about 50 numbers which are joint angles and the translations and rotations of the base of the spine. Okay. So, imagine we got--one of these mannequins you see in art shop windows, we got a pins stuck in the base of his spine and we can move him around and rotate him using this pin and we can also wiggle his legs and arms. Okay. And what we want him to do is as we move him around, we want him to wiggle his legs and arms so his foot appears to be stationary on the ground and he appears to be walking. And he'd better wiggle his leg just right as we translate his pelvis, otherwise his foot will appear to skid on the ground. And we're going to model him, we can do hierarchal model like I just showed you or we can a three-way model like this where we condition six earlier of frames, this is a current visible frame, here's basic bolts in the machine accept that it's neither one of these 3-way things where these are factors. And we have a 1-of-N style variable. So, we have data and we tell it the style when we're training it, so that's sort of semi-supervise. It learns to convert that 1-of-N representation in to a bunch of real value features and then it uses this real value features as one of the inputs to a factor. And what the factor are really doing is saying, these real value features are modulating the weight matrixes that use for conditioning and also this weight matrix that use in your parallel linear model. So, these are modulating an auto aggressive model. That's very different from switching between auto aggressive model it's much more powerful. Yeah? >> I missed with what this one event is...? >> HINTON: So, we're going to have data of someone walking in various different styles. >> Styles of walking. >> HINTON: The style of walking. Yeah. >> So you mean your earlier diagram when you can't beat history, it looked like there was nothing to keep track in the relative order of the earlier direct delta direct link because... >> HINTON: Yes. >> ... is there anything in the model that cares about that relative...? >> HINTON: Yeah. Yeah. The weights on the connections will tell you which frame it's coming from. Right. In the earlier model, there were two blue lines, they're different matrixes and they have different weights on. >> There's nothing from two steps to previews to ones step previews, right, it just skip all the way? >> HINTON: It just skipped all the way, right. You just... >> Will that continue to happen? >> HINTON: Yes. In other words there's direct connections from all six previous frames to the current frame for determining the current frame. >> Right. And then what links from the six frames of the fifth earliest report? >> HINTON: Well, there where when you were computing what the fifth frame was doing, right? >> Okay. >> HINTON: But when we're computing this frame we have direct connections from it. Okay. So, we're now going to train this model, it's relatively easy to train especially on the GPU board, and then we're going to generate from it, so we can see sort of what it learned and we can judge if it's doing well by whether the feet slip on the ground. All right. >> [INDISTINCT] >> HINTON: We'll get there. >> Sorry. >> HINTON: Here's a normal walk. Maybe, at least they're willing. Okay. So I was generating from the model--he's deciding which direction to turn in, and he's deciding, you know, he needs to make the outside leg go farther than the inside leg and so on. If we--we have one model but if we flip the style label to say, gangly teenager, he definitely looks awkward. Right. We've all been there. I think this is a computer science student. My main reason for thinking that is if you asked him do a graceful walk, it looks like this. And that's definitely C3PO. >> [INDISTINCT]. >> HINTON: Now, I think this was a student [INDISTINCT]--but he's very good. You can ask him to walk softly like a cat. We're asking to model at present, right? The model looks pretty much like the real data the real data obviously the feet are planted better but notice, he can slow down then speed up again. Auto aggressive models can't do things like that, Auto aggressive models have a biggest size in value, the size is bigger than one in which case they explode or a smaller one in which case they die and the way to keep them alive is by keep--you keep injecting random noise so that they stay alive and that's like making a horse walk by taking a dead horse and jiggling it, it's kind of--it's not good. Now, he doesn't have any model of the physics so, in order to do this kinds of stumbles, there had to be stumble similar to that in the data but when he stop in which he stumble he did when, he's entirely determining. We could make him do a sexy walk but you're probably not interested in that. >> I just order a chicken. >> HINTON: You want dinosaur the chicken? Where's dinosaur the chicken? >> And chicken, number five. >> At number five. >> HINTON: Oh, no, that's dinosaur and chicken. That's a blend. Maybe a switch. He's got quite a lot of foot [INDISTINCT] that's probably a blend. This is doing a sexy walk and then you flip the label to normal and then you flip it back to sexy. It's never seen any transitions but because all one model, it can do reasonable for transitions. >> So you have these hundred style variables, can you de-couple those form the one event style and just make up new styles by playing with those... >> HINTON: Yup. Yup. Now, you can also give it many more labels when you train, you can give it speed, stride length all sorts of things then you can control it very well, yeah. Okay. So, you can learn time series at least for 50 dimensional data and obviously what we all want to do is apply that to video but we haven't done that yet. Except for some very simple cases. The last thing I'm going to show is the most complicated use of these 3-way models. One way of thinking of it, so that it's similar to the previous uses, is that we take an image and we make two copies of it but they have to be same. And then we insist the weights that go from a factor of this copy are the same as the weights that go from the factor of this copy. So if I=J, WI=WJF. Inference is still easy in fact inference here will consist of--you take these pixels times these weights to get a weighted sum, and then you square it because this is going to be the same weighted sum. So, inference consist--take linear filter, square its output, and send it by these weights to the hidden units. That's exactly the model called the [INDISTINCT] energy model, is right, kind of linear filter. This is being proposed both by Vision people by Adelson and Bergen, a long time ago in the '80s, and by neuroscientists. So, neuroscientists had tried to take simple cells, my point vaguely about, and look at what polynomial they're output is of their input, and Yang Dan at Berkeley says it's between 1.7 and 2.3, and that's means two. So, this looks quite like models that were proposed for quite different reasons and it just drops out of taking a 3-way imaging model and factorizing it. The advantage we have is that we have a learning algorithm for all these weighs now, when we generative model. So now we can model covariances between pixels, and the reason that's good is–-well, here's one reason why it's good. Suppose I asked you to define a vertical edge. Most people will say, "Well, vertical edge is something that light on the side and dark on that side. Well no, maybe it's light on this side and dark on that side, but you know. Well, it could be light up here and dark down there, and dark up here and light down there." Okay. Or it could be texture edge. It's getting--oh, it might actually be a disparity edge. Well, the manner should be motion this side and no motion that side. That's a vertical edge too. So, a vertical edge is a big assortment of things, and what all those things have in common is vertical edge is something where you shouldn't do horizontal interpolation. Generally new image, horizontal interpolation works really well. A pixel is the average of its right and left neighbors, pretty accurately almost all the time. Occasionally it breakdowns, and the place it breaks down is where there is a vertical edge. So, a real abstract definition of vertical edge is breakdown of horizontal interpolation. And that's what our models are going to do. A hidden unit is going to be putting in interpolation, and it's actually going to turn-off sort of reverse logic, when that breaks down its going to turn-off, so one way of seeing it is this. If this hidden unit here is on, it puts in a weigh between pixel I and pixel J that's equal to this weight times this weight, times this weight. Okay. Since these–okay, that's good enough. So, these are controlling affectively the mark of random field between the pixels, so we can model covariances nicely. Because the hidden units are creating correlations between the visible units reconstruction is now more difficult. We could reconstruct one image given the other image, like we did with motion, but if you want to reconstruct them both and make them identical it gets to be harder. So, we have to use a different mathical Hybrid Monte Carlo. Essentially you start where the data was and let it wonder away from where it was but keeping both images the same. And I'm not going to go to Hybrid Monte Carlo, but it works just fine for doing the learning. And the Hybrid Monte Carlo is used just to get the reconstructions and the learning algorithm just the same as before. And what we're going to do is we're going to have some hidden units that are using these 3-way interactions to model covariances between pixels, and other hidden units are just modeling the means. And so we call-–for meaning covariance, we call this mcRBM. Here's an example of what happens after it's learned on black and white images. Here's an image patch. Here's its reconstruction of the image patch, if you don't have noise, which is very good, from the mean and covariance units. Here's the stochastic reconstruction which is also pretty good. But now we're going to do something funny, we're going to take the activations of the covariance units. The things that are modeling which pixels are the same with which other pixels and we're going to keep those. But we are going to take the activations of the mean units, so we're going to throw those away, and pretend that the means from the pixels look like this. Well, let's take this one first. We tell all the pixels have the same value, except these are which are much darker and it now tries to make that information about means fit in with these information about covariances which is of these guys should be the same but very different from these guys. And so, it comes up with the reconstruction that it looks like that. Where you see it's taken this dark stuff and blurred across this region here. If we just give it four dots like that, and the covariance matrix you've got from there, it'll blur those dots out to make an image that looks quite like that one. So this is very like what's called the, kind of watercolor model of images, where you know about where the boundaries are and you just, sort of, roughly sketching the colors of the regions and it all looks fine to us, because we sort of slaved the color boundaries to the actual--where the edges are. If you reversed the colors of these it produce the reversed image because the covariance doesn't care at all about the signs of things. If you look at the filters at your lens, the mean units which are for sort of coloring in regions, learn these blurry filters--and by taking some combination of a few dozen of those you can make more or less of what other colors you like anywhere. So, very blur there-–smooth, blurry, and multicolored and you can make roughly the right colors. The covariance units learn something completely different. So, these are what the filters learned and you'll see that, those factors, they learn high frequency black and white edges. And then a small number of them, turning to low frequency color edges that are either red, green, or yellow blue and what's more when you make it from a topographic map using a technique I'll describe on the next slide. You get this color blob, this low frequency color blob in with the low frequency black and white filters. And that's just what you see in a monkey's brain, pretty much. If you go into a monkey's brain you'll see these high frequency filters whose orientation changes smoothly as you go through the cortex tangentially, and you'll see these low frequency color blobs. Most neuroscientists thought that at least must be innate. What this is saying is, "Nope. Just the structure of images is, and the idea of forming a topographic map, is enough to get this." That doesn't mean it's not innate, it just means it doesn't need to be. So the way we get the topographic map is by this global connectivity from the pixels to the factors. So the factors really are learning local filters. And the local filters start off colored and gradually learn to be exactly black and white. Then there's local connectivity between the factors in the hidden units. So one of these hidden units will connect to a little square of factors and that induces a topography here and the energy function is such that when you turn off one of these hidden units to say smoothness no longer applies, you pay a penalty. And you derive to just pay the penalty once. And so two factors is going to come on at the same time, it's best to connect them to the same hidden unit so you only pay the penalty once. And so that will cause similar factors to go to similar places in here when we get a topographic map. For people who know about modeling images, as far as I know, nobody has yet produced a good model of patches of color images. That is the genres of model that generates stuff that looks like the real data. So, here's a model that was learned on 16x16 color images from the Berkeley database and here's these generated from the model. And they look pretty similar. Now, it's a partly a trick, the color balance here is like the color balance and it makes you think they are similar. But, it's partly real. I mean, most of these are smooth patches of roughly uniform color as are most of these. These are few more of these as smooth than those. But you also get these things where you get fairly sharp edges, so you get smoothness than a sharp edge than more smoothness, like you do in the real data. You even get things like corners here. We're not quite there yet but this is the best model there is in patches of color images. And it's because it's modeling both the covariance and the means, so it's capable of saying, "What's the same as what?" As well as, "What the intensities are?" You can apply it for doing recognition. So this is a difficult object recognition to ask where this 80 million unlabelled training images; not only of these classes but of thousands and thousand of classes. They were collected by people at MIT. It's called the Tiny Images database. They're 32x32 color images. But it's surprising what you can see in a 32x32 color image. And since the biggest model we're going to use has about a hundred million connections, that's about 0.1 of a cubic in ratio of cortex in terms of the number of parameters, and so we have to somehow give our computer model some way of keeping up with the brain which has a lot more hub, right? And so we do it by giving it a very small retina. We say, "Suppose the input was only 32x32, maybe we can actually do something reasonable there." So as you'll see there are a lot of variations. If you look birds, that's a close up of an ostrich, this is a much more typical picture of a bird. And it's hard to tell the difference within these tiny categories. Particularly things like deer and horse. We deliberately chose some very similar categories like truck and car, deer and horse. People are pretty good at these. People won't make very many errors. That's partly because these were hand-labeled by people, so. But even people make some of errors. We only have 50,000 training examples. Five thousand of each class and ten thousand test examples, because we have to hand-label them, but we have a lot of untrained--unlabelled data. So we can do all these pre-training on lots of unlabelled data and then take out covariance units on our mean units and just try doing multi-[INDISTINCT] on top of those, or maybe add another hidden layer and do it on top of that. So, what Marc'Aurelio Ranzato actually did since he worked in Yann LeCun's lab, he actually took smaller patches learned the model and then strode them across the image and replicated them. So it's a sort of a semi-convolutional. And then took the hidden units of all of these little patches and just concatenated them to make a great big vector of 11,000 hidden units which are both the means on the covariances. And then we're going to use that as our features and see how well we can do. And we're going to compare it with various other methods. So the sort of first comparison, you just take the pixels and do logistic ration on the pixels to slide on the tiny glasses. You get 36% right. If you take GIST features which has developed by Torralba and the people at MIT, which were meant to capture what's going on under the image quite well, but they're fairly low dimensional, you get 54%. So they're much better than pixels. If you take a normal RBM which has linear units with glass and noises input variables and then binary hidden units, and then you use those binary hidden units to do castigation, you get 60%. If you use one of these RBMs with both the units like these once for doing the means, and then these units with the three range interaction for modeling covariances, you got 69%; as long as you use a lot of these factors. And if you then learn an extra hidden layer of 8,000 units--so now it's just that times that is a hundred million, so there's an extra hundred million connections you learn there. But that's fine because it's unsupervised then you just learn it on lots of data. You get up to 72%. And that's the best result so far on this database. One final thing, you can take this model that was develop for image patches and the student that'll be doing framing recognition just took that code and applied it to log spectrograms, which is sort of more close to what they would like to see, you're not using all these mark up fool stuff, which is designed to throw away stuff you think you don't need and get rid of lots of correlations. Instead you're going to take data that has lots of correlations in but we got a model that can deal with that stuff now. And the first thing George tried on February the 20th, which was four layers of a thousand hidden units on top of this, he got 22.7 correct--percent correct; which was the record for phoneme recognition on the TIMIT database where your not trying to do a model adapted to each speaker. And then a week later when he did that to TIMIT and use more frames, he was down to 21.6%. So this--all this stuff was designed to do vision. It wasn't designed to do phonemes. And if we treat phoneme recognition, it's just a vision problem on a lot of spectral ground. We can wipe out the speech class, at least on small vocabulary. Another student is now, at Microsoft, is seeing if this will work on big recovery as well. >> [INDISTINCT] >> HINTONS: Yes. Yes, right. >> We can give them new better tools. >> HINTONS: We can give them new and better tools. So here's phoneme recognition over the years. Backprop from the 80's got 26.1 percent correct. Over the next 20 years or so, they got that down to 24.4 percent, using methods that weren't learning-inspired so we'll call them artificial. We then got down to 21.6 percent; an estimate of human performance is about 15 percent. I don't know much about how they did this estimate, I'm afraid. But we're about--we're nearly a third of the way from artificial TIMIT. And so we need two more ideas and we're there. Okay, I'm done. I'm finished. >> Questions? >> HINTONS: Yes? >> You mentioned YouTube recently announced that the [INDISTINCT] have broken the world record on the end list of data sets of phoneme recognition by simply using a seven layered feed forward network trained with backprop, but doing it on a GPU with lots and lots of cycles. >> HINTONS: Yes, he did indeed announce that. What he didn't announce was--he's got a spectacular result. He gets timed to 35 errors. What he didn't announce was there's two tricks involved. One trick is to use a big net with lot of layers in a GPU board. That trick by itself wouldn't give you 35 errors. There's a second trick which was sort of pioneered by people at Microsoft in fact, which is to put a lots of work into producing distortions of the data so you have lots and lots of labeled data. So you take a labeled image of a two and you distort it in clever ways and make it still look like a two but be translated so people can then get down to about 40 errors. >> I think they patented that already. >> HINTONS: Good. So Dick's already patented that. So you get down to--you can get down to by 40 errors by doing these distortions. What he did was even better distortions, or more of them, and a much bigger net on a GPU and he got from a 40 to 35, which is impressive because it is hard to make any progress there. But it won't work unless you have a lot of labeled data. And what's--the disguised thing is the work went into--if you look in the paper, it's always straightforward, its just backprop, except when you get to the section of how they generated all those sector labeled data where there's very careful things, like if it's a one or a seven they'd only rotate it a certain number of degrees but if it's something else they rotate it in more degrees. I'm actually the referee for this paper but I don't mind him knowing. I think it's a very important work. But he should emphasize that they have to have labeled data to do that, and they have to put work into distortions. So for me the lesson of that paper is when we small computers, you should put your effort into things like weight constraints so you don't have too many parameters because you only got a small computer. As computer gets bigger and faster, you can transfer your effort from, instead of tying the weights together, like Yann was doing in the early days, put your effort into generating more distortions so you can inject your prior knowledge in the form of distortions and that's much less complication-efficient over the big computers, it's fine and it's more flexible. So I think that's the lesson of that paper. >> I shouldn't even need to ask you a question, you answered it. Thank you. >> HINTON: Any other long question? >> It seems like you've invented some kind of a cortex here that house you expect the property that if it does vision it'll do sound. >> HINTONS: Yes. >> What other problems you going to apply it to? >> HINTONS: Maybe it'd be quicker to say the problems we're not going apply. >> Okay. >> HINTONS: I can't think of any. I mean--okay, let me say what the main limitation of this is for vision. We got at least 10 billion neurons for doing visual things; or at least a billion anyway, probably, 10 billion. And even if we got that many neurons and about 10 to the 13 connections for doing vision, we still have a retina that's got a very small phoneme the size of my thumb there at arms length. And so we still take almost everything and don't look at it. I mean, the essence of vision is not to look at almost everything intelligently; and that's why you got all this funny illusions where you don't see things. We have to do that in these models. These models are completely crazy. And all of computer visions are completely crazy, almost all of it. Because they take a uniform resolution image, and quite a big one like a thousand by thousand, and they try and deal with it all at once with filters all over the image. And if they going to do a selection, they either do it by running off their face to get to everywhere, with no intelligence, or they do sort of interest point detection at a very low level to decide what to attend to. What we do is we fixate somewhere. Then on the basis of what our retina gives us, with these big pixels around the edges and small pixels in the middle, we sort of decide what we seeing and where to look next and by the second or third fixation we've fixating very intelligently and the essence of it is that vision is sampling, it's not processing everything; and that's completely missing from what I said. Now in order to do that, you have to be able to do take what you saw and where you saw it and combined them and that will multiply. So this module, it can do multiply. It's very good in combining what's and where's, to integrate information at a time. And that's one of the things, we're working on that. But that's probably the biggest thing missing. But that is an example of having a module is quite good but now it's never good enough, so you have to put it together over time and use it many times. And that's what sequential reasoning in all this stuff are. So basically, as soon people become sequential we're not modeling that at all. We're modeling what you can do in hundred milliseconds. And so that's what's missing. But I believe that to model that sequential stuff we need to understand what is the sequence of, is the sequence of these very powerful operations. And we're in a better shape now to try and model sequential AI, than we were if we didn't know what a primitive operation is. So this sort of primitive operation was just deciding whether two symbols are the same. We're going to be out of luck for understanding how people do sequential stuff. Yeah. >> This is a [INDISTINCT] question as he said he wanted to do everything if it connects. Are you going to do [INDISTINCT] logic like there exists a God and every girl has a boy she loves? >> HINTON: Hang on, I'm still processing that. Right. Right, I'm making the point that people find "quantifies" quite difficult. >> Oh, yeah. If you [INDISTINCT] quantifiers... >> HINTON: I would love to do that. I have not got a clue how to do it. And you will notice that in old-fashioned AI that you used to point out to [INDISTINCT] people, then you can't do quantifiers, so forget it. Nowadays, when they all do graphical models, they didn't mention that anymore because the graphical models have difficulty of it too. Some people has got [INDISTINCT] some people do. Right. Yeah, some people do. But most of the graphical models of, like, five years ago, they do quantifiers either. And so, a pretty good division line would be what you can do without having to deal with really sophisticated problems like that. I would love to know how we deal with that, but I don't. >> Thank you. >> HINTON: So, yeah, I'm going to give up on that right now. >> So if you had 80 million labeled images and no extra unlabeled ones, would you do your pre-training... >> HINTON: Yes. Yes >> ...and then fine tuning to make us better? >> HINTON: In TIMIT, that's what we have. In TIMIT, all the examples we have labels. It stirs a big wind to do the pre-training. >> But you didn't sneak this result I'm just hearing about? It seems to suggest... >> HINTON: Well, the audience switched it but I haven't tried with all these distortions during pre-training. Now, I've assumed student called [INDISTINCT] who just produced a thesis. Well, he tries things like that. He tries distortions in earnest and he uses special distortions of his own. And the fact is distortions helped a lot. But if you do pre-training, that helps some more too. And [INDISTINCT] results, yes, [INDISTINCT] results, suggest that pre-training will get you through different part of the space even if you have all these labeled data. So clearly, one thing that needs to be done is to try the pre-training and combine with these labels. You don't have to have the pre-training, but I bet you, it still helps. And I bet you, it's more efficient too. It's faster because the pre-training is rather pretty fast, you always have to learn a very good model. You got lots of its features. And starting from there, I think, you'll do better than he does just started from random, and faster. That's just a prediction. You might even get done to 34 out of this. The problem with [INDISTINCT] you can't get significance. TIMIT is really nice that way. They designed it well, so you get higher rates. So you can see differences. >> On the time series aspect, did you see anything that would you get inferences or alterations that are beyond the size of the time window you're using? >> HINTON: Sorry, I didn't understand the question. We have a limited time. We don't... >> You have limited time, after training is there anything of a model that picks up... >> HINTON: Nothing. >> Nothing. >> HINTON: Nothing. It cannot deal with--it can't model host... >> It has an internal state. It has an internal state. >> HINTON: Right. But if sort of what happened 15 times steps ago really tells you what should happen now, and it only tells you what you should happen now. It doesn't tell you what should happen in TIMIT 14 times steps. It just contains information across 15 times steps without having a signature of smaller time scales. You can't pick up on that. >> Okay. >> HINTON: Because it's not got a hidden forward-backward algorithm. A forward-backward algorithm potentially could pick up a lot of load, actually can't. >> So this one wouldn't pick up on things like object permanence or all rules behind the box and comes out of the other side and they're not going to be able to... >> HINTON: Not over a long time scale, no, no. Unless you say that there's a memory involved when you go back to a previous--it gets more complicated, right? Now, it is true that when you build the multilevel one, which you can do with the three interconnections as well as with the three-way connections, at every level you're getting a bigger time span because your time window, it's going further back into the past with each level. So you get a bit high, but that's just sort of linear. >> Can you say--do you have any rules of thumb of how much unlabeled data you need to train each of the different levels and how it would change, like, is it just linear with the number of rates or as you go up levels the things changed? >> HINTON: I have one sort of important thing to say about that, which is that if you're modeling high-dimensional data and you're trying to build an unsupervised model of the data, you need many less trainings on [INDISTINCT] than you would have thought if you use the discriminative learning. When you're doing discriminative learning, there's typically a very few bits per training case to constrain the parameters. You're going to constrain--you got many new parameters for a training case is the number of bits it takes to specify the answer, not the number it takes to specify the input. So within this, you get 3.3 bits per case. If you're modeling the image, the number of bits per case is the number of bits it takes to specify to image which is about a hundred bits. So you need far fewer cases per parameter. In other words what I'm saying is you're modeling much which are things, and so each case is giving you much more information. So actually, we can typically model many more parameters than we have training cases. And discriminative people aren't used to that. Many less parameters than we have pixels and many more than training cases. And in fact, he used about two million cases for doing the image stuff, and it wasn't enough, it was over fitting. He should have used more. But he was fitting 100 million parameters. But the--basically, the only rule of thumb is many less parameters and the number of total number of pixels in your training data, but you can typically use many more parameters in the number of training cases. And you can't do that with normal discriminative learning. Now, if you do do that, when you start discriminative training, it quickly improves things and then very quickly over fits. So you have to stop it early. Okay. >> Okay? >> HINTON: Thanks. >> Let's thank the speaker again. >> Thank you.
B1 US model hidden hinton data image visible Recent Developments in Deep Learning 174 14 Howard Liu posted on 2015/06/05 More Share Save Report Video vocabulary