Subtitles section Play video Print subtitles [MUSIC PLAYING] MARTIN GORNER: So thank you for filling the house. I'm really impressed that TensorFlow is getting so much attention. But I think the technology deserves it. So I'm happy about that. So today, or yesterday, we built a neural network for recognizing handwritten digits. And we went through dense neural networks and convolutional neural networks. And today, I want to build with you another kind of network, a recurrent neural network. So let's go. A couple of reminders from yesterday. You remember, we started with this initial one layer neural network. And I need you to remember the formula for one layer of a neural network because we will be using it. So we were reading in pixels. But it works for any input vector, of course. And you remember, we said that the neurons do weighted sums of all of their inputs, they add in bias, and they feed that through some activation function. Here, softmax. It can be another function. It's just a function, value in, value out. But usually, in neural networks, it's a non-linear function. And we wrote this one layer neural network using a matrix multiply, blah, blah, blah, we've seen all that, as this formula. And you remember, we did this not for just one image, but we actually wrote this formula processing 100 images at a time. So in x, we have a batch of images, a whole batch, 100 images. And then x times w are all the weighted sums for our neurons. We add the biases. We feed that through our activation function. And we obtain a batch of predictions. So in our case, since we were classifying handwritten digits, those predictions are 10 numbers, which are the probabilities of these digits being a 0, a 1, 2, 3, and so on. And so we obtained those probabilities as the outputs of our 10 neurons. OK, so whenever you see this formula, we will see it again a lot today, you think, one layer of a neural network. OK? And then also, what I need you to remember is that once we get our output from the neural network, the way we train it is that we give it examples. It produces some prediction. And then we say, no, no, no, no, no, that's not what we wanted. This is what you should predict. We give it the correct answer. And to do that, we have to encode this correct answer in a similar format. So it's called-- it's a very basic type of encoding. It's called one hot encoding. And basically here, if we have 10 categories, specifying one answer category means encoding it as 10 0s, but just one 1 somewhere in the middle, and the index of the 1 here, it's at index 6. Means that the correct answer was a 6, OK? So in this shape, it becomes possible to compute a distance between what the network predicts and what we know to be true. And that distance, we call that our error function. Or sometimes, it's called the loss function. That's what we use to guide our training. So during training, we give it an example, produces an output. We say, no, no, no, that's not what we wanted. Compute the distance between what the network says and what we know to be true. And from that distance, we derive the gradient. And then, we follow the gradient. And that modifies the weights and biases. And that's what training is about, OK? So now, let's look at this neural network. So it should look familiar. It's a vector as an input. It has a middle layer using the hyperbolic tangent as an activation function. So we've seen the sigmoid last time, which is let's say the simplest possible function going from 0 to 1 continuously. The hyperbolic tangent is the simplest possible function going from minus 1 to 1 continuously. It's just a sigmoid shifted. And then, a second layer, which is a softmax layer so that we read something out. But the specificity is here that the output of this intermediate green layer is actually fed back in the next time step in the inputs. So the real input into one cell of a recurrent neural network is the input concatenated to the output of the inner layer from the previous step. And we call this the state. So it's actually a state machine. You feed it inputs. It produces outputs. But you also feed it a state. It produces an output state, which you feed back in, in the next time step. And that's why it's called a recurrent neural network, it's because it is applied on time sequences. At each step in time, you feed in one input vector, concatenate it to the previous state. Turn the crank once. That produces some outputs from this middle layer, as well as a result. And you feed that back as the new input state for the next x input, which you have in your sequence. So it can be represented. I'm showing you the neurons inside. But here, it's basically the API of one recurrent neural network cell. It has an input. It has an output, which you then usually feed into a softmax layer to make sense of it, to produce predictions. I mean, probabilities. And it has an input state that produces an output state that you loop back in as the input state. That's the state machine part. OK? So now, well yes, and the parameter for this is the internal size of this middle layer. That's what is adjustable. Usually, your input is whatever your input is. And your output is whatever you're trying to predict. So those are not adjustable parameters. So here it is written in equations. Again, the input is the real input at time t concatenated to the previous state. Then, we feed that through. Here, you should recognize one layer of a neural network. You should recognize this formula using the hyperbolic tangent as an activation function. So I put it over there. And this produces an output, Ht, which is both used as our new state and as the output that will be fed into the softmax layer to actually produce a vector of probabilities between 0 and 1. OK? So now, how do we train this thing? So typically, this is used for natural language processing, for instance. So a typical input will be a character. And a character will be, again, one-hot encoded into let's say 100 competent vectors if we are using-- we will be using here an alphabet of 100 possible characters. So one character is encoded into a 100 element vector, so 99 0s and a 1 at the ASCII index of that character. So we put a character in. We propagate through the neural networks. We propagate through the softmax layer. We obtain a character as an output. If that is not the character we wanted, well, we compute the difference between what he said and what we know to be true and use retro propagation to fix the weights and biases inside of the cell to get better results. That is very classical training. But what if the result was wrong not because the weights and biases inside of the cell was wrong, but because the input, the state input, H minus 1, was wrong? That input is a constant in this problem. There's not much you can do about it. So here, we are stuck. What is the solution? Well, the solution is to replicate the cell. And now, so this is a replica. It's reusing the exact same weights, OK? Now, let's see the output, Y1, is bad. I say, no, that's not it. This was the correct output I'm training. So I know what the correct output is supposed to be. So from that, I compute the enter, the gradient. I retro propagate. I can fix the weights and biases in the cells to get a better output. And if needed, I can fix the weights and biases to get a better H0, the state flowing between those two stages of the cell. So now, I have a handle on at least H0. I still have no handle at all on H minus 1. If it is H minus 1 that was wrong, there is nothing I can do. So that is how you train recurrent neural networks. You have to unroll them across a certain length and give them a sequence of, let's say, characters. It will produce a sequence of output characters. If you are training, you know what the answer was supposed to be. So you use that to compute your error function, do you retro propagation, adjust the weights and biases. And it will work to a certain extent. To what extent? Oh, yes, small detail, if you want to go deep, you can actually stack the cells. Why? Well, two cells stacked like this, the API remains the same. It's still an input. It's still an output that feeds into a softmax layer. And there is still an input state and an output state that you feedback in. It's just that the output state now is slightly bigger. So that's how you go deep in the recurrent neural network. You stack those cells, and that becomes a new cell which still has input, output, input state, output state. And of course, you unroll it. So let's take this sentence. Let's say now, we use not characters but words as our inputs. Of course, there are technical problems doing that. A typical alphabet is maybe 100 characters. A typical vocabulary is around 30,000 words. So here, one-hot encoding gives you a vector of 30,000 components for each word. It's a bit heavy. I won't go into the details of how you handle that. It's called embedding. Whatever, let's just assume that we solved this problem. So we have this sentence. Michael was born in Paris, blah, blah, blah, blah, blah. And at the end, we have his mother tongue is. So if we train this model on English, probably it will have figured out that "his mother tongue is" is followed by the name of a language, like English, German, or Russian, something. Here, however, the correct answer is French because this guy was born in France. So let's imagine that we feed in, we have unrolled this neural network over, let's say, 30 words. Or let's say 10 words here, 10 words. And at the end, we have his mother tongue is, and we are asking the network to predict what is the next word. And the network says, English. So now, what we want to do is put on the outputs a sentence that says, blah, blah, blah, his mother tongue is French. And do retro propagation. But for this to work, the beginning of the sentence, the part where the information about Paris and where he's born is, has to be part of that example. And that example is longer than 10 words, which is our unroll size. There is simply no way whatsoever of putting that correct example plus correct output into a network that we unrolled over only 10 words because the distance is more than 10. And that's a fundamental limitation. If you want to capture this information, this behavior that, if he was born in France, probably his mother language is French, you will have to unroll this network over a long enough sequence to be able to input this full example into it. And if you do that, you will probably unroll it here over, how many, 50 words? Something like that. If you do that, the problem is that you end up with a very deep neural network. Yesterday, we've seen neural networks of five layers. The big ones, like Inception and so on, are 40, 50, 60, 70 layers. You see here, we have a toy example. And we already see that we should be going to 50 or 100 layers just to solve this. So in recurrent neural networks, you always end up using very deep neural networks. And when I say deep, it's because the state signal has to go through all those cells. And remember, in each cell, the state signal is concatenated to the input which goes through a neural network layer, produces a new state, which goes to the next cell, that is concatenated to the input. Goes to another neural network layer. So from here to the end, we traverse at least one neural network layer per cell. That's how wide and deep. Deep neural networks have a technical problem. They tend not to converge when you train them. I won't go into the mathematical details. It's called the vanishing gradient problem. Basically, your gradient becomes 0. And since you use your gradient to go forward, that's a bit of a problem. So a solution was invented. I won't go into the mathematical explanations of why this solution works. I just want you to understand how it works. So would you [INAUDIBLE] an explanation using the arrow soup of a diagram on the left? Or the incomprehensible equations on the right? Which one do you prefer? AUDIENCE: Arrows. MARTIN GORNER: Arrows? I'm a developer. And those equations look a little bit like code. And I do code. Sorry. But on the arrows, you see at least one thing. So I'll do some hand-waving mathematics again. You see that the state is actually split into two. You have the H state and the C state. And the C line there is actually configured in such a way that the network can decide to persist information on it, to leave it unchanged from iteration to iteration. And that is somehow what explains that even if you line up many of those, since it has the possibility of leaving some part of the state unchanged, it goes around those vanishing gradient problems. End of hand-waving mathematics. So let's do it. Let's see how it works in practice. And actually, it's based on a concept of gates. So again, we concatenate the real input to the state from the previous step. And we compute three, you recognize the formulas, neural network layers. The sigma is for the sigmoid activation function. So the sigma outputs values between 0 and 1. And we call those gates because we will actually be multiplying these numbers to another vector to gate it. You know, if you multiply something by a very small value, there's not much that goes through. If you multiply something by something that is close to 1, almost all of the information goes through. So that's how we will be using them. Now, our input becomes, well, we have to size adapt our input. I put on the side the sizes of all the vectors we are working with. That's just to tell you that there is nothing to see there. Inside of the cell, everything is of size n. That's the parameter that you decide as the size of your cell, OK? But our inputs, they are what they are. So we first need one neural network layer to adapt the size of our inputs to size n. So that becomes our new input. And now, the C line, the way you read this, this is a kind of memory. So the new state of the memory is the old state of the memory without what we chose to forget. We multiply by this forget gate. This is a series of numbers between 0 and 1. Plus what we chose to remember from our new input. That's the way to read it. So we multiply our new input by the update gate. Again, numbers between 0 and 1 that shows which part of the information we want to retain from this input into our internal memory. And then, our new state is simply the memory-- the hyperbolic tangent here, that's not a neural network layer. That's just a size adaptation to put it between minus 1 and 1. So it's basically the memory cell multiplied by the result gate. So here, we choose what part of our internal memory we want to expose to the outside as a result. So that's the physical interpretation of these equations. We have those three gates. We size adapt our input. And then, the new memory is the old memory minus what we want to forget plus what we want to remember from the input. And the result is this memory cell modulo what we want to actually expose as an output at that step. OK? And now, this HT will actually become part of the new state, and also drive the softmax layer if we add a softmax layer, which is represented here by this yellow circle. We usually represent the softmax layer as external to a cell. So this is called an LSTM. And this was invented specifically to make recurrent neural networks work and to solve this depth problem that, if you are unrolling over a large sequence, they tended not to converge. You will have to believe me on the mathematics with this. They converge. But you will have to-- I'm sure someone noticed that this choice of equations and this choice of arrows was somehow arbitrary. I mean, why point them here and not there? Many combinations exist. Lots of different variations of those R and N cells have been devised. And someone published a paper, a recap paper, where he tested all of them and found them to do all exactly the same thing. So in the end, the one we use is called the GRU. And I won't go into the details. It's basically a cheaper LSTM. Here are the equations. Not very different. Same API. But only two gates instead of three gates. And each gate has weights and biases. So we save a part of our computational cycle not computing those third weights and biases. OK, so we will use the GRU. And now, let's implement a neural network that does a language model. So we will be training on sequences of characters. And when I say language model, it's actually a network that we will use to predict-- we will train it to predict what the next character is. Like here, St. Joh, I will teach it to produce the same sequence shifted by one. So actually, I will teach it to understand that the next character should be an "n" because this is St. John. So how do we do that? In TensorFlow, now I'm using a higher level API of TensorFlow than what I had been using yesterday. I just call GRUCell. That creates a GRU cell. And I call this higher level, because you've seen this GRU cell has actually a couple of neural network layers inside. It has two gates. That's at least two layers. So it has a host of weights and biases which are actually defined in the background when I call this. That's why it's a higher level API. It does its own weights and bias declarations in the background. Now, I said we want to go deep. So let's stack this cell three high. That's how we do deep recurrent neural networks. Three is a TensorFlow call for that. It's called MultiRNNCell. Give it a cell. You say how many times you want to stack it. And that gives you another cell. Because we have seen already that these three stacked cells actually have the same API as one cell. So you can use it as a new cell. And now, we need to unroll this. For that, we call in TensorFlow flow this dynamic RNN function, which is a bit of magic. And that's what will unroll this sequence. So how many times? You don't see it in the parameters because it's actually specified in the shape of the input tensor, x. If this input tensor has eight or let's say 30 characters in it, it will be unrolled over a sequence of 30 characters. And actually, the little part of magic, really it's magic, we will not be using it here, but this dynamic RNN, what it can do also, remember that we will be training this on batches, as always. We always train on batches. So in this case, all my batches will be sequences of the same size. That's the case in my model. In other models, I might not have sequences of the same size. Dynamic RNN can handle that. If you pass it a batch of sequences, even if they are not of the same size, alongside that you pass the actual sizes, and it will, for each sentence in the batch, unroll your network the correct number of times. And then, also pass the output from the correct stage. It's super helpful. Will not be using it here because all of our sequences have the same size. But that is super helpful. All right, so now, we need to implement our softmax layer from those H double second 0 to H second 8. Well, basically the outputs at the bottom. We know how to do a softmax layer, OK? But here, since we have unrolled, remember each stack here is a copy of the previous one. We are sharing the weights. So on the softmax side, we have to share the weights, as well. So we could do this using the TensorFlow APIs. You know, define one softmax layer. And then, for the next one, call an API that retrieves the weights of the previous one and reuses them. That's too complicated here. Actually, there is a little hack that you can use. Remember, we are always training on batches, OK? So this will be taking a batch of sequences, outputting a batch of sequences. Each sequence is a sequence of characters. So what is the difference between having, let's say, 8 softmax cells that each process a batch of 100 characters, or having just one that process 800 of them? That's the same thing. Let's just do one. And we will put all of those outputs in the same bag and just use that one cell. Anyway, we were supposed to be sharing the weights, so defining just one cell is a very good way of doing that. So that's what I do with my reshape operation there. I take all of those outputs. and you have to remember that there is a batch of outputs on each of those arrows. And I put them in the same bag. Feed them through just one softmax layer. And then, I will reshape them back into the correct shape to finish. Again, using higher level APIs in TensorFlow, so when I call linear, that just does the weighted sums. One layer, it computes simply the weighted sums. No activation function. And then, I called softmax. And that applies the softmax activation function. And linear, again, defines the weights and biases in the background. That's why I call it a higher level function. And now, I'm ready to compute my loss function and derive it and actually train the network. It's just as complicated to understand how recurrent neural networks work. And it's just as complicated to actually feed them data correctly. You see lots of arrows. So we will have to do quite a bit of plumbing to make this happen. Let's try to get our inputs and outputs right, OK? So we will be inputting sequences of characters by batches. So my inputs are a batch of sequences. The sequence length that I have chosen is 30. I will be unrolling over 30 characters. Usually, on the diagrams, I only represent 8 of them because 30 would not fit on my slide. But in the code, it was 30. Now, I need to one-hot encode them. So I'm adding a new size. Each character becomes a vector of 100 components because I am working with an alphabet of 100 possible characters. So now, it's batch size, sequence length, and alpha size. Those are my actual inputs. My state, again, I have a batch of states. Since I'm feeding in a batch of inputs, I will produce a batch of output states. And the states, each of those state vectors is, of course, of size n, cell size, whatever cell size I have chosen to use. Remember, each cell has this one configuration parameter, which is its internal size. But since I have stacked those cells three high, it will actually-- the actual output state here will be three times the cell size. OK, we are ready to write this model. So I define a placeholder for my input sequences, a batch of sequences of size, sequence, length. I one-hot encode them, which is why I'm adding a new size to this tensor, which is the size of my alphabet. Again, each character becomes a vector of 100 components. To be really precise, my alpha size is 98, so 98 components. I'm working with an alphabet of 98 characters here. I need to define a placeholder for my correct answers. And actually, the correct answers are very easy to obtain here. I'm just teaching it to output the same sequence shifted by one. So basically, to predict what the last character will be. So again, the correct answers will be a batch of sequences of 30 characters, which I one-hot encode. I need a placeholder also for my input state. And we have seen that the batch of input states, we have seen that the input state is made of three of those internal vectors. So that's three times cell size. And now, I'm ready to write my model. So the model is what was here, OK? That's the model. This model, with this little trick that we have seen before, this model at the output of its softmax layer actually produces an output that is batch size multiplied by sequence length. You remember, we put all the characters from the batches and from the different stages of the unrolled sequence in the same bag. And now, to determine characters from those probabilities, I use argmax. Why? Because each of those vectors is 100 components with probabilities. Argmax is a function that gives me the index of the biggest number in this vector. So the index in this victory is actually the ASCII code of the character that has been predicted. So these are my predictions now in ASCII encoding in characters. And I just need to reshape them back to have, again, a batch of sequences of 30 predicted characters. And now, I'm ready to input-- to give my loss to an optimizer and ask TensorFlow to optimize to actually train my network. So this is the step, as yesterday, with this loss. TensorFlow computes a gradient. From this gradient, it can-- sorry. And this loss is, of course, the difference between the sequence of characters that was predicted and the sequence of characters that I wanted to predict. This difference becomes a loss. That loss is derived, becomes a gradient. We take a small step along this gradient, which is actually in the space of weights and biases. So taking a small step means we modify slightly our weights and biases and continue. That's the training. One last little gotcha. So we have to take our input text and actually cut it up in those sequences of 30 characters. So initially, I thought, well, that's easy, you know? You take a piece of text. How do you cut it up in sequences of characters? Well, you cut, and cut, and cut, and cut. And then, if you need a batch of them, you take the first 100 sequences you have. And you put that in a batch. That did not work. Why? Let's see here. That's my first batch. Let's see the first sequence in the batch? The quick-- you know what that is going to be. The quick brown fox something. Well, when my neural network processes the quick, it also outputs an output state. And in the next iteration, that output state will become the input state for the next sequence. If I want this to be correct, that input state must correspond to the continuation of the quick brown fox, and so on, which means that the sentence has to continue over all of the first slots of all of my batches. It's a not completely trivial way of batching here. You cut up your text in batches, in sequences. But the way to batch them together, since you have to pass the correct state at each stage, is that the beginning of the text has to be split across the first item in batches. And then, from some point far, far, far later in the text, you can start filling the second line of the batches. It's just plumbing. I wrote for you the five lines of code that does this. It's five lines. I spent four hours doing it, including tests. I don't do arithmetic. It's full of modulos and divides. And I wrote unit tests and hacked it until the unit tests passed. It's called test-driven debugging. Sorry, test-driven development. That's what developers do. All right, so yeah, small gotcha on the batching. But whatever. Just use the code on-- this is not actually important. Just use the function that will cut up the text correctly for you. And you're ready to train. And this is actually the full code of this neural network on one slide. So let's go through this again. A placeholder for my input sequences. I one-hot encode them. I'm actually inputting sequences of characters, OK? And all the people with cameras, this is on GitHub. And the GitHub link is on the last slide. So please take pictures. My Twitter handle is over there. Tweet them. But then, you will be able to go and GitHub and actually retrieve this. Then, my expected outputs, why underscore? Again, I define a placeholder for them. I will need to feed them during training. And the first thing I do is that I one-hot encode them. I will also need, and this is different from normal neural networks, I will also need a placeholder for my input state. Remember? RNNs have an input and an input state. Two inputs. Now, I'm ready to write my model. So I chose the GRU cell. I stack it three high. And I unroll it as many times as x has components in it. So here, my unroll size is sequence length. And that's 30. I chose 30 characters as the unroll size of my recurring neural network. I do my little trick with the softmax so that I can implement just one softmax node. I feed the output through my softmax node. Here, I apply argmax to retrieve from the softmax probabilities the highest probability. And that's the character I'm predicting. I reshape this back to have a batch of predicted sequences. Also, somewhere in the middle in there, I had those probabilities. I take those probabilities. And I compute the distance between what it says and what I wanted. That's my loss. I give my loss to the optimizer. I obtain a training step. And this training step is actually that gradient, which is computed on this batch of training characters. And which, if I follow it by a little step, will modify my weights and biases and bring me to somewhere where this network works better, where it has a smaller error function. And now, my training loop. You will see, this is very familiar to what we had previously. We use this magic plumbing function that I gave you to load sequences of characters in the correct way. And once I have a sequence of characters, I run session.run of my training step. I have to give it the input characters. I have to give it the expected output. And, since this is a recurrent neural network, I have to give it the input state. And this will give me an output state. And you see the magic line, why this is a recurrent neural network. That's the last line there in the red. Input state becomes-- sorry, the output state becomes an input state. That's why it's recurrent. They're passing the state around. All right. So we are done. We've built a recurrent neural network. Now, we want to actually train it. So let's go to a demo. I will be training this on the complete works of William Shakespeare. That's not quite big data. The complete works of William Shakespeare are five megabytes. Yes, that puts things in perspective. But it's good stuff. So here, we see it's training on sequences. So here are those sequences of 30 characters. And here's a batch of them. It's actually training on those sequences. Here, predicting not much at all. It's just the beginning. And from time to time, I stop the training. And I take just my one cell. Remember, I have just one cell. It's replicated for the purpose of training, but it's just one cell. And this one cell, it has become-- well, once it will be trained, it will have become-- a language model. So what I can do with it is generate a new Shakespeare play. How do I do that? Well, I take the cell. I put in garbage, a random character. That gives me an output character. Probability of an output character, which is the next character, and an output state. You feed back the output state in the input, and I feed back the output character as the new input. And I continue. And this is a state machine that will start generating text. You see here, it's-- yeah. That's not quite Shakespeare yet. It's training. It's a bit slow on my machine. I usually have a GPU connected here to-- it brings me a nice 10x speed, or 6x roughly. But still, well, it has done 50 more batches. I will leave it running. Let's go and see, sorry, here. On this slide, what it is. So at the beginning, it gives you this. As I said, not quite Shakespeare. But after only an epoch, what we call an epoch, it's when you have seen the entire training data set once. So after having seen only a tenth of what Shakespeare produced in his life, this is what we have. Still not quite Shakespeare, but you see there is some structure to it. After two tenths, hey, this looks better! It's starting to actually spell English almost correctly. And there are those things in capital letters at the beginning that are starting to look like characters, like character names. Even slightly later, oh, look! And you have to remember that this is a neural network that is predicting character by character. It first has to learn to spell English before going to higher orders of structure. So it's still not completely exact English. But it's starting to look like English. At least, Shakespearean English. And you see it has character names. And it's actually inventing new character names. Here, Pordia and Henry Blutius-- who can tell me, no, seriously, who can tell me if Shakespeare actually used Henry Blutius in his work? What do you call it? I'm giving you the answer. He didn't. But it's a very credible Shakespearean character name. And this is what you get after 30 epochs. So it actually has a title. There is an act. There's a scene. After the scene, it tells you where this is happening. And look, it knows how to put scenic indications in brackets, who enters, with whom, and so on. It has even picked up stuff like, character names are all caps, and when the character is a function, like Lord or Chamberlain, it's only the first character that is a capital. It has picked up completely correctly as well. And it's actually English. So now that we have this, let's try to-- so this was on slides. I will stop this. What I have done previously is that I trained this for actually 30 epochs. And I saved my weights and biases. So I'm ready to just replay it and generate a new Shakespeare play. Let's generate a new one live in front of you. Here it is. Let me stop it. Whoops, sorry about that. [APPLAUSE] MARTIN GORNER: Is someone brave enough to come and play hallucinated Shakespeare on the stage with me? Come on. Yes! Thank you. Big applause. SPEAKER: Come on up? MARTIN GORNER: Thank you. Please, come up. You will have to speak loudly. But that's how it is in a theater. You don't have a microphone. You speak. So you can read off the screen here. We will alternate. So maybe I start, and then you do the next one. So let's say enter Bardolph and Boult. The manner off with my bestowers that you shall not see him, and we are now to be the brother's wife and force, to be so many and most grave. SPEAKER: What art thou again? What needs thy life? Then, what they do not dote on thee. The word will be at thee. And take my heart to thee. And they distemper. Will thou beat me well to say god save my son? [APPLAUSE] MARTIN GORNER: Thank you so much. SPEAKER: Thank you. MARTIN GORNER: Thank you. That was fantastic. Thank you. Actually, I tried to do this also on the Python code of TensorFlow itself. That was fun. So in the beginning, you had this. Looks like Python? Maybe. But very, very quickly, it actually picks up Pythonic structures, like those keywords, and it's generating something that looks like function calls. Slightly later, it actually correctly uses the keywords with function names-- a hallucinated function name. It's actually quite inventive in the function names. And a colon at the end. It's still getting the nested parenthesis wrong. And after a longer while, it can recite the Apache license in full. Yes. It's open source compliant, open source compliant. And more interestingly for us, designers of recurrent neural networks, it can actually close and open the nested parentheses right to a depth of three, which is quite impressive. And what I find fantastic, it's that it has figured out how to do Python comments. And it's giving me TensorFlow advice in those comments. But look, it makes sense! Check that we have both scalar tensors for being invalid to a vector of one indicating the total loss of the same shape as the shape of the tensor. I'm sure this makes just as much sense as everything that I've been saying since the beginning here. All right, and small credits to a gentleman called Andrej Karpathy who actually wrote this neural network for the first time. He published a blog about it. He tried it on many different things. He generated a business book for startups. And he tried to generate an algebra book in LaTeX. Actually, after training, this produced almost valid LaTeX. So he had to hack it a little bit to make it compile. But then, this looks like an algebra book. That is even an attempt at a diagram. And the line I prefer is how the neural network solved how to write a proof. Look at the very top. "Proof omitted." That's so clever. All right. So that's basically all I wanted to show you. Well, this is how we generate it. So we, I take just one cell. And basically, in a loop, I feed in a character. I take the output, feed it back as the input, and feed the output state as the input state, and just do this in a loop. A couple of applications of this. Oh, yes. Actually, we still have a little bit of time. This time, I've been using TensorBoard to visualize my inputs and outputs. Where is my TensorBoard? Somewhere. Sorry, I'll find it. Here. In the last session, I was just throwing the outputs into matplotlib, which is the very standard Python plotting library. But there is a tool dedicated to visualizing training in TensorFlow. It's called TensorBoard. And I advise you to use it, especially if you do a distributed training, or training on remote servers. It can connect to a bucket and get the information from there and visualize it. So here again, I have configured, when I was training this network, I configured it to actually do training and validation. I put one Shakespeare play aside for validation to test my network. And if you'll remember the session from yesterday, I find it very important to follow my loss curves, both the training and the test loss curve, on the screen. This is what I got. And actually, first of all, who sees something wrong? Overfit, yeah. And so now, the question is, why is it overfitting here? I will give you the answer because you can't guess it. But here, I was actually training on a small subset of the Shakespeare corpus. So here, it was overfitting because of lack of data. And since I had this on the curves, I wanted to show it to you because you certainly remember that somewhere, where is it? This one. Somewhere here, I had this helpful engineering chart. Which allows you to interpret what overfitting is. And we went yesterday through the bad network. We went through too many neurons. We never had not enough data. So I tried with not enough data. And yes, it also gives you this very recognizable pattern in the curves. And as soon as I train with more data, this is what I have. So here, the two curves follow each other closely. And I know that I have solved the problem. So actually, I was doing this because I was trying to add dropout into my network to make it work better. No. It was misbehaving just because of lack of data. Dropout would not have solved that. All right, and so a couple of applications, practical applications to finish. We've seen how to produce a character by character model. We can also use this not character by character, but word by word. So as I said previously, with a word, it's a bit more complicated because to one-hot encode a word, you need to encode it on a vector of, this time, 30,000 components. Because that's the typical size of a vocabulary, typical language. So those are big. So on the inputs, there is actually a very simple solution. How do you reduce the size of a big vector? Well, you use one layer of a neural network and produce less outputs. It's called embeddings. And that layer can either be part of your training-- then, your embeddings are learned as part of the training. Or, you can use some neural network that has been already trained, typically trained on the English language generically, and that just encodes words into smaller vectors. There is a very famous, what is it, neural network that has been built for that. It's called Word2Vec. Already trained, available on GitHub. You can use that to encode your English words if your problem deals with English words as smaller vectors. And so once we have solved this problem of how to input words instead of characters, you can, for example, use a recurrent neural network like this to predict not what the next word is, but a categorization of a sequence. And this is used in newspapers to automatically categorize articles as geopolitics, science, sports, and so on. Works very well. How do you do translation? Well, to do translation, that's how Google Translate works, you tack two of those recurrent networks end to end. To the first one, you apply an English sentence plus a stop symbol. And then, you continue. And you ask it to output the French sentence. And what you have on the input, there is a choice. Normally, you should be inputting what your network outputs. But people have also tried to input what the network should output. So both options exist. And they give you different results. You can read about this in literature. So this is how translation works. Of course, you have a big problem at the end. I won't go into that. Because to do the softmax layers there, you actually want to produce a vector of 30,000 probabilities. That's a bit heavy. So there are ways of mitigating that. But that's an active area of research. One that is implemented in TensorFlow is called sample softmax. But there are many others because this is an active area of research. How to do this softmax layer to produce 30,000 probabilities each time, which is a bit heavy. And one more is image labeling. So here, it's a very simplified version of image labeling. Image labeling is you take an image, and you want to produce a sentence. Like, this a little girl holding a teddy bear. This is a truck in the desert. So this is actually also a translation problem. You take vectors from an image. And you apply a recurrent neural network to produce a sequence of words which you want to be the description of this image. How do you encode an image as a vector? Well, there are plenty of solutions. One of them is to take an off the shelf image recognition neural network, like Inception, and just chop off the last couple of layers. Normally, what Inception gives is categories. This is a truck. This is a beach. This is a lizard. That's not what you want. But all the top layers are actually encoding an image in some meaningful way into a vector. You can use that as a fixed encoding function. And input the vector corresponding to the image here. Produce this output sequence. And sometimes, it works really well. This is what was generated. A herd of elephants walking across a dry grass field, and so on. And then sometimes, yeah, not quite. Thank you. [APPLAUSE] [MUSIC PLAYING]
B1 input neural neural network network output state TensorFlow and Deep Learning without a PhD, Part 2 (Google Cloud Next '17) 235 27 jwlee posted on 2017/04/23 More Share Save Report Video vocabulary