Subtitles section Play video Print subtitles [MUSIC PLAYING] MARK OMERNICK: Good morning. My name is Mark Omernick, and I'm a software engineer with Google AI. Today I'll be talking about two projects. The first is enhanced Unicode support across the TensorFlow code base. And the second is a new tensor type called RaggedTensors, intended to officially represent sequence data. First, I'll take a quick look at we've improved Unicode support in TensorFlow. Unicode is a way of encoding characters from nearly every written language using sequences of bytes. Here, these four characters can be represented as four triplets of bytes. A string containing these four characters would be 12 bytes long in total. Previously, TensorFlow assumed that strings were indexed by individual bytes, ASCII style. That led to issues like this, where strings split would split Unicode characters below the character boundary, and substr would index by bytes instead of characters. However, now that we've added Unicode support to TensorFlow, we can correctly handle multi-byte characters. Unicode_split now splits into proper triplets, and substr, with the UTF8_CHAR tag, indexes by UTF-8 characters. In addition to string splitting, TensorFlow now supports many other Unicode-aware string operations, from Unicode encoding and decoding to string length analysis. For the second part of this presentation, I'd like to introduce a new tensor type, RaggedTensors, that we designed to handle text and other variable length sequences. RaggedTensors are a native representation for sequences of varying shape. Here you can see a RaggedTensor containing three batch items. The first, a tensor with two strings, the second, a tensor with four strings, and the third, a tensor with one string without any additional padding or user-facing logic. RaggedTensors are different from SparseTensors in one key way. SparseTensors make the assumption that the underlying dense tensor is regularly shaped and unmentioned values are missing. RaggedTensors, on the other hand, make no such assumption. Here, for instance, the SparseTensor interprets the first batch element as John, null, null, while the RaggedTensor interprets it as simply John. A RaggedTensor can contain any number of irregular dimensions. Here, for instance, we have a three-dimensional RaggedTensor that represents every character in every token in a batch of three sequences. There are variable numbers of tokens per sequence and variable numbers of characters per token. But with RaggedTensors, you don't need to worry about maximum sizes, padding, or anything else. RaggedTensors are a native TensorFlow representation for any varying length sequence of data, from words to images and beyond. You could imagine using RaggedTensors to contain the set of still frames in a batch of videos, where each video is a different length. So how do you use RaggedTensors? Let's start with building them. To create a RaggedTensor, you'll need a flat tensor of values and some specification on how to split those values into batch items. Once you have a RaggedTensor, you can perform standard tensor operations on it, like concatenation and slicing, even within irregular dimensions. RaggedTensors are natively supported by over 100 TensorFlow core ops ranging from math ops through string handling to reductions. And if you need to operate on each value in a RaggedTensor, we provide a native map function. You can use this to apply ops or even entire subgraphs to every value in a RaggedTensor. To illustrate how to use RaggedTensors in a model, let's consider using a bag of character level embeddings to create a token level embedding. We start by taking a RaggedTensor of tokens separated by batch and applying unicode_decode, a new op that outputs a RaggedTensor of Unicode code points separated by batch and token. We can then use map_flat_values to get an embedding for each of these code points. Now, char_embedding is a four-dimensional RaggedTensor with batch, token, sentence, and embedding dimensions. We can convert it into a standard four-dimensional tensor, reshape it, so that it is token_major, run a convolution over each character in each token, then reshape it back into a dense 40 tensor with batch, token, sentence, and embedding dimensions. That 40 dense tensor can be converted back into a 40 RaggedTensor, which removes any padding. This RaggedTensor can be reduced, via reduce_mean, to create per token embeddings. At the end, we have a tensor of embeddings, one for each token, built from characters without any extraneous padding. For more information, you can take a look at the tutorials available here. Please try them out and give your feedback on GitHub. Thank you. [MUSIC PLAYING]
B2 tensor unicode token batch string padding Improving Text In Tensorflow (TF Dev Summit ‘19) 3 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary