Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARK OMERNICK: Good morning.

  • My name is Mark Omernick, and I'm a software engineer

  • with Google AI.

  • Today I'll be talking about two projects.

  • The first is enhanced Unicode support across the TensorFlow

  • code base.

  • And the second is a new tensor type

  • called RaggedTensors, intended to officially represent

  • sequence data.

  • First, I'll take a quick look at we've improved Unicode support

  • in TensorFlow.

  • Unicode is a way of encoding characters

  • from nearly every written language using

  • sequences of bytes.

  • Here, these four characters can be represented

  • as four triplets of bytes.

  • A string containing these four characters

  • would be 12 bytes long in total.

  • Previously, TensorFlow assumed that strings

  • were indexed by individual bytes, ASCII style.

  • That led to issues like this, where strings split would split

  • Unicode characters below the character boundary,

  • and substr would index by bytes instead of characters.

  • However, now that we've added Unicode support to TensorFlow,

  • we can correctly handle multi-byte characters.

  • Unicode_split now splits into proper triplets, and substr,

  • with the UTF8_CHAR tag, indexes by UTF-8 characters.

  • In addition to string splitting, TensorFlow now

  • supports many other Unicode-aware string

  • operations, from Unicode encoding and decoding

  • to string length analysis.

  • For the second part of this presentation,

  • I'd like to introduce a new tensor type, RaggedTensors,

  • that we designed to handle text and other variable length

  • sequences.

  • RaggedTensors are a native representation

  • for sequences of varying shape.

  • Here you can see a RaggedTensor containing three batch items.

  • The first, a tensor with two strings,

  • the second, a tensor with four strings, and the third,

  • a tensor with one string without any additional padding

  • or user-facing logic.

  • RaggedTensors are different from SparseTensors in one key way.

  • SparseTensors make the assumption

  • that the underlying dense tensor is regularly shaped

  • and unmentioned values are missing.

  • RaggedTensors, on the other hand, make no such assumption.

  • Here, for instance, the SparseTensor

  • interprets the first batch element as John, null, null,

  • while the RaggedTensor interprets it as simply John.

  • A RaggedTensor can contain any number of irregular dimensions.

  • Here, for instance, we have a three-dimensional RaggedTensor

  • that represents every character in every token

  • in a batch of three sequences.

  • There are variable numbers of tokens

  • per sequence and variable numbers of characters

  • per token.

  • But with RaggedTensors, you don't

  • need to worry about maximum sizes, padding,

  • or anything else.

  • RaggedTensors are a native TensorFlow representation

  • for any varying length sequence of data,

  • from words to images and beyond.

  • You could imagine using RaggedTensors

  • to contain the set of still frames

  • in a batch of videos, where each video is a different length.

  • So how do you use RaggedTensors?

  • Let's start with building them.

  • To create a RaggedTensor, you'll need a flat tensor

  • of values and some specification on how to split

  • those values into batch items.

  • Once you have a RaggedTensor, you

  • can perform standard tensor operations

  • on it, like concatenation and slicing,

  • even within irregular dimensions.

  • RaggedTensors are natively supported

  • by over 100 TensorFlow core ops ranging from math ops

  • through string handling to reductions.

  • And if you need to operate on each value in a RaggedTensor,

  • we provide a native map function.

  • You can use this to apply ops or even entire subgraphs

  • to every value in a RaggedTensor.

  • To illustrate how to use RaggedTensors in a model,

  • let's consider using a bag of character level embeddings

  • to create a token level embedding.

  • We start by taking a RaggedTensor of tokens

  • separated by batch and applying unicode_decode, a new op that

  • outputs a RaggedTensor of Unicode code points separated

  • by batch and token.

  • We can then use map_flat_values to get an embedding

  • for each of these code points.

  • Now, char_embedding is a four-dimensional RaggedTensor

  • with batch, token, sentence, and embedding dimensions.

  • We can convert it into a standard four-dimensional

  • tensor, reshape it, so that it is token_major,

  • run a convolution over each character in each token,

  • then reshape it back into a dense 40 tensor with batch,

  • token, sentence, and embedding dimensions.

  • That 40 dense tensor can be converted back

  • into a 40 RaggedTensor, which removes any padding.

  • This RaggedTensor can be reduced, via reduce_mean,

  • to create per token embeddings.

  • At the end, we have a tensor of embeddings, one for each token,

  • built from characters without any extraneous padding.

  • For more information, you can take a look at the tutorials

  • available here.

  • Please try them out and give your feedback on GitHub.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it