Subtitles section Play video Print subtitles [MUSIC PLAYING] LAURENCE MORONEY: Welcome to episode 2 of this series of Zero to Hero with Natural Language Processing. In the last video, you learned about how to tokenize words using TensorFlow's tools. In this one, you'll take that to the next step, creating sequences of numbers from your sentences and using tools to process them to make them ready for teaching neural networks. Last time, we saw how to take a set of sentences and use the tokenizer to turn the words into numeric tokens. Let's build on that now by also seeing how the sentences containing those words can be turned into sequences of numbers. We'll add another sentence to our set of texts, and I'm doing this because the existing sentences all have four words, and it's important to see how to manage sentences, or sequences, of different lengths. The tokenizer supports a method called texts to sequences which performs most of the work for you. It creates sequences of tokens representing each sentence. Let's take a look at the results. At the top, you can see the list of word-value pairs for the tokens. At the bottom, you can see that the sequences that texts to sequences has returned. We have a few new words such as amazing, think, is, and do, and that's why this index looks a little different than before. And now, we have the sequences. So for example, the first sequence is 4, 2, 1, 3, and these are the tokens for I, love, my, and dog in that order. So now, we have the basic tokenization done, but there's a catch. This is all very well for getting data ready for training a neural network, but what happens when that neural network needs to classify texts, but there are words in the text that it has never seen before? This can confuse the tokenizer, so we'll look at how to handle that next. Let's now look back at the code. I have a set of sentences that I'll use for training a neural network. The tokenizer gets the word index from these and create sequences for me. So now, if I want to sequence these sentences, containing words like manatee that aren't present in the word index, because they weren't in my initial set of data, what's going to happen? Well, let's use the tokenizer to sequence them and print out the results. We see this, I really love my dog. A five-word sentence ends up as 4, 2, 1, 3, a four-word sequence. Why? Because the word really wasn't in the word index. The corpus used to build it didn't contain that word. And my dog loves my manatee ends up as 1, 3, 1, which is my, dog, my, because loves and manatee aren't in the word index. So as you can imagine, you'll need a really big word index to handle sentences that are not in the training set. But in order not to lose the length of the sequence, there is also a little trick that you can use. Let's take a look at that. By using the OOV token property, and setting it as something that you would not expect to see in the corpus, like angle bracket, OOV, angle bracket, the tokenizer will create a token for that, and then replace words that it does not recognize with the Out Of Vocabulary token instead. It's simple, but effective, as you can see here. Now, the earlier sentences are encoded like this. We've still lost some meaning, but a lot less. And the sentences are at least the correct length. That's a handy little trick, right? And while it helps maintain the sequence length to be the same length as the sentence, you might wonder, when it comes to needing to train a neural network, how it can handle sentences of different lengths? With images, they're all usually the same size. So how would we solve that problem? The advanced answer is to use something called a RaggedTensor. That's a little bit beyond the scope of this series, so we'll look at a different and simpler solution, padding. OK. So here's the code that we've been using, but I've added a couple of things. First is to import pad sequences from pre-processing. As its name suggests, you can use it to pad our sequences. Now, if I want to pad my sequences, all I have to do is pass them to pad sequences, and the rest is done for me. You can see the results of our sentences here. First is the word index, and then is the initial set of sequences. The padded sequence is next. So for example, our first sentence is 5, 3, 2, 4. And in the padded sequence, we can see that there are three 0s preceding it. Well, why is that? Well, it's because our longest sentence had seven words in it. So when we pass this corpus to pad sequence, it measured that and ensured that all of the sentences would have equally-sized sequences by padding them with 0s at the front. Note that OOV isn't 0. It's 1. 0 means padding. Now, you might think that you don't want the 0s in front, and you might want them after the sentence. Well, that's easy. You just set the padding parameter to post like this, and that's what you'll get. Or if you don't want the length of the padded sentences to be the same as the longest sentence, you can then specify the desired length with the maxlen parameter like this. But wait, you might ask what happens if sentences are longer than the specified maxlen? Well, then, you can specify how to truncate either chopping off the words at the end, with a post truncation, or from the beginning with a pre-truncation. And here's what a post looks like. But don't take my word for it. Check out the Codelab at this URL, and you can try out all of the code in this video for yourself. Now that you've seen how to tokenize your text and organize it into sequences, in the next video, we'll take that data and train a neural network with text data. We'll look at a data set with sentences that are classified as sarcastic and not sarcastic, and we'll use that to determine if sentences contain sarcasm. Really? No, no. I mean, really. [MUSIC PLAYING]
B1 sequence index sentence neural length pad Sequencing - Turning sentence into data (NLP Zero to Hero, part 2) 9 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary