Subtitles section Play video
[MUSIC PLAYING]
PAIGE BAILEY: Hi, my name is Paige Bailey.
And I am here today to deliver the talk of Karmel
Allison and Yash Katariya, who unfortunately couldn't be here
today.
But for the purposes of this talk,
just imagine that I am Karmel.
So to get started--
my son is in kindergarten.
And he's just starting to learn to read.
And one of my favorite parts of this process
is that even before he could reliably decode words,
he started to produce them.
And, of course, doing what I do, I
can't help but be amazed at how similar his learning
process is to neural nets.
The rule set for spelling and grammar
is so complex in English that we rarely present
a set of instructions that he can follow.
Instead, he starts almost randomly with letters
and sounds.
And then he gets feedback from me
and from other readers in his life which he incorporates.
Some things he memorizes.
Some are lucky guesses.
And somehow over months of learning,
he started to form a consistently interpretable
mental model of written words.
And he produces mostly understandable text.
So here you can see, "I was eating breakfast
with my cousins and my sister."
I've been particularly fascinated by my son's learning
here, because it happens to align with what many of us
have started to call an NLP revolution.
We have enough data and enough tools
that we've started to rapidly push the cutting
edge in natural language understanding and related
tasks.
I've heard it said that, with text today,
we're at a cambrian explosion, like when
ResNet was first published.
We're at the beginning of this faster than--
faster than expected progression of language models.
So we're going to take advantage of all of this research
and tooling.
And we're going to teach a simple neural net to generate
the next word of a phrase.
And we'll use this to tell a robot children's story.
So the first thing we need is data.
And we're going to take advantage
of the children's book test corpus released by Facebook.
This data set is a set of Creative Commons
children's books that have been converted into a series
of passages and fill-in-the-blank style
questions about each passage.
From this model, we just need the raw book text which
looks like what you see here.
As an aside, these books are out of copyright,
which means they're often old.
And that means the corpus is full of literature
that's problematic by today's standards.
I'm going to gloss over that for the sake of this talk.
But if you go and actually use this data set,
please do consider what cultural norms
you might be teaching your machine learning models.
So first, we're going to load the data.
Now, that we have a corpus, we can
load into our Python interpreter and start playing with it.
Here, we use the text line data set to load the data.
And we can simply print out a few lines to see what we have.
So tell me what you notice here.
It looks like we have some cleaning work to do.
So using our data set much like we would use a NumPy array
or Pandas data frame, we can filter
and map transformations across it.
Here, I'm dropping those pesky book titles
and filtering out punctuation within each row.
And after that transforming, I can print out a few lines
to check and make sure the data looks as expected.
Now, I have a data set where each row is
a sentence of arbitrary length.
And I've decided I want to train a windowed model to predict
a new word, given some words to start.
So I need to take these data set rows and make
THEM all equal length.
Once again, without leaving TensorFlow data sets,
I can split, flatten, and regroup my rows
so that each row is a set of 11 words.
But I want that last word in each row
to be a separate label.
So I just define a simple row-wise function
or pop out my labels.
And voila, we have pairs of examples and labels.
But there's still a problem with this data.
Why can't I just feed this into a dense network
or any arbitrary model?
Well, it's words.
Machine learning models don't speak English.
And, of course, the problem with this data set
is that we need numbers--
only numbers, not lines, just numbers.
So we're going to need to transform our input sentences
into numeric representations.
And the way that we do this is with preprocessing
layers, which are highly exciting
and a new addition to TensorFlow 2.
So preprocessing layers were recently
reviewed as part of the Keras API specification.
And there are a set of layers that
take care of data transformations
that are outside the training path.
These follow the same APIs as normal layers
and can be called and composed just like layers.
They play nicely with TensorFlow data sets as well.
So you can paralyze preprocessing transformations.
And importantly, just like normal Keras layers,
these become part of your model and therefore, get
serialized in the SavedModel and become part of inference
to prediction, which is critical to minimizing training
serving SKU.
This is the set of preprocessing layers
that is already complete.
They are experimental in 2.2, as we ensure that the APIs match
how you use them.
So please check them out for all of your preprocessing needs.
So now we know how to get our data into the correct format.
And the next step is to build a language model that
can learn a representation of all of these words
so that we can generate text on the fly.
One of the classic models used for text translation
and generation is a sequence-to-sequence model.
A sequence-to-sequence model typically has two parts--
an encoder that uses RNN blocks to encode
input data and a decoder that borrows state from the encoder
in order to correctly predict the target outputs.
Now, sequence-to-sequence models have some complicated
moving parts.
It's not a simple feed-forward network,
and there are a lot of parameters to keep track of.
But luckily, we don't have to go it alone here.
TensorFlow AddOns is a community-maintained repository
built on top of TensorFlow 2 that
provides especially complex or experimental layers,
optimizers, and other utilities.
And one such utility is the seq2seq package,
which provides a number of layers and classes
that make building sequence-to-sequence models
much easier.
And you'll see me use these throughout this example.
Because the architecture of the sequence-to-sequence model
is fairly complex and requires special state passing,
we're going to subclass the Keras-based model
and build our network explicitly.
We start here with the text factorization layer
we've already discussed since that's
going to be what we feed in our input data
through to convert it to indices.
After we vectorize the inputs, we're
going to pass them through the encoder blocks--
first in embedding, then in LSTM.
And you'll note that in our init function here,
we're just configuring the layers.
And we're not actually passing any data through them yet.
In Keras, each layer has two stages--
the construction, when you parametrize
your layer, as seen here, and the calling
of the layer, which will come later
when we pass our data through.
The decoder is somewhat more complicated.
But here, we can leverage the TensorFlow AddOn sampler
and decoder and set up the decoder LSTM
and connect it to projection layer, which
is our final dense layer that maps to the vocabulary
we want to predict, plus two tokens for predicting calls
and out of vocabulary words.
The final set of layers we will set up
are a pair of attention layers.
There is a large and growing field of attention, research,
and machine learning.
But given the time constraints of this talk,
I'm only going to give you a very hand wavy explanation
and say, attention is a technique
that allows the model to track intermediate states as it
steps long sequences.
And it will allow the model to give more weight
to certain time steps of a sequence
when predicting the final word.
Here, we use a simple dense attention
layer that comes with tf.keras.
And you'll see how we connect this between our encoder
and decoder in a minute.
So we have a lot of state to pass between our encoder
and decoder, which means we can't just use
the standard Keras fit call.
And one of the things I'm most excited about in TensorFlow 2
is that we have refactored and cleaned
up the Keras training loop.
And we've made it much more modular.
So instead of overriding the entire fit loop
or throwing it out altogether, I'm
just going to define a single forward pass
from my model in this special function,
train_step, and that is going to get called
by model.fit with each step.
So we overwrite train_step in our encoder decoder model.
That train_step is going to get one batch of data at a time,
so we just need to define the forward pass for that one
batch.
And the first thing we do here is unpack our data
and separate the example from the label.
You'll note that we call our own vectorization layer here
to ensure that our input strings get correctly
transformed to indices.
Next, still inside our single training step,
we're going to record our forward pass
under a gradient take.
Anything that needs to be back propped through
should go under here.
So while the vectorized layer or preprocessing layer was out
of the tape, our encoder embedding, LSTM, and so forth
all belong under the tape.
And here we pass our inputs through the set
of layers we defined on our init to encode them.
We also set up our attention layers
to track the intermediate state coming out
of the encoding layers.
And next, we decode, which is to say
we try to predict our targets using
the state from our encoder.
The decoder here, we'll go over the many epics it runs for,
train its own weights separate from the encoder weights.
And in concert, the encoder and decoder
will learn to predict text based on the outputs.
And now that we've run all of our layers
necessary to form the forward pass, we can compute the loss
and collect the outputs of this step
so that they can be optimized.
The Keras model takes care of collecting variables
and setting the optimizer, so we just
choose how we want to pass things through here.
As the final step, we collect and return
the metrics we set in our model.
The next step is to pick an optimizer loss and accuracy
metric for our model.
And these are going to govern the actual training
and optimization process.
And we select them from a bunch of independently parametrizable
options that are built into tf.keras, and then we train.
It might take a while to convergence.
So I threw in a model checkpoint to callback
to make sure I can save my model weights as I go.
We can monitor progress as we train.
And our goal is to reach some degree of convergence
of the reported accuracy.
And with this model, this happens somewhere around 45
epics at about 70% accuracy.
And 90% accuracy is pretty good.
But we might ask ourselves, can we do better?
And the parameters I chose when training
were thoroughly arbitrary, copied
from somewhere on the internet.
Maybe the models should be bigger or smaller.
Typically, we spend some time tuning these model parameters
to ensure that we have the best results for a given model
architecture.
We call this process hyperparameter tuning.
And notably, easy hyperparameter tuning
is one of the most requested features for Keras.
And good news, we have a package for that.
KerasTuner was released last October.
And it works with TensorFlow, Keras, and even scikit-learn.
It allows you to create hypermodels
that encode tunable parameters, distribute through training,
and state the model for others to use.
So let's take it for a quick spin
in order to tune some of our models parameters.
For example, let's say we wanted to tune the number of RNN units
in our model.
We can import the pip installable KerasTuner package
and then define a function that takes hyperparameter object
and uses it to build a compiler model.
Inside this function, instead of passing
in a fixed init for RNN units, we
can use the magic and selector objects,
which will allow us to try any integer in this range.
There are a number of different selectors you can use here,
including floating point numbers and enums.
We can then define a tuner algorithm
for searching our hyperparameter space
and use the tuner to build and fit our model intelligently
across that space.
Because we overwrote train_step with our custom functionality,
everything works within the Keras ecosystem.
And the tuner will be able to call .fit,
just like we did to get the correct training behavior.
The tuner will run through all of the different hyperparameter
combinations you've configured.
And it even works with Colab and prints out trial information,
as you see here.
And after a few tuning sessions, we
see that, in fact, the best RNN unit count is 1,024.
And rerunning with 1,024 units, we improve our accuracy.
And now we have an even better model
that gets above 90% accuracy.
So now that we have a trained model, of course,
our goal was not just to create it,
but to actually generate text.
So how do we take the model we've built
and use it to write a sentence?
The first step is to use our model's predict one word
given the length of the input, which is exactly what we
trained the model to do.
Just as we did with train step, we
can overwrite predict step to define the operations
on just one batch of data.
In our predict step, we run through the--
or we've run the inputs through the same encoder and decoder
we saw with training, but now with fixed weights.
We can also throw in some custom logic here.
And we allowed the model to predict from the top end
choices instead of always the most likely word.
And we also convert back to the actual English word
rather than just returning the numeric indices.
We can try this out.
And we see that, indeed, we produce the next predicted word
correctly.
But, of course, we don't want single words.
We want a whole sentence so we can go further and define
a custom predict that just takes a single string
and then generates one word at a time
to continuously append to that starting string
and generate a much longer string.
And lo and behold, we can generate
vaguely meaningful statements.
It's not perfect.
And it's entirely unpunctuated.
But it can be a lot of fun.
And unlike Karmel's six-year-old,
it doesn't lie and happens to like telling funny stories
and doesn't get tired.
You can see that there is a little bit of punctuation that
is vaguely human-understandable, even though the model is
quite simple and the data is relatively small
and constrained.
And indeed, the model we just built
is the very first baby step of text processing.
And it is heavily restricted by having it fit onto slides.
But these same tools and techniques
are used to build some amazing large-scale models that
run at Google scale.
So if you're interested in moving
from just learning to sending emails
and going big with text in TensorFlow 2,
check out some of the code that researchers and engineers
at Google have released, including
the tf.text repository and KerasBert,
as well as keras-transformer, which
is an example of a truly cutting-edge NLP model.
You'll also hear more about TFHub in the next talk,
so stay tuned.
So to summarize, use TensorFlow data sets and preprocessing
layers to transform your inputs.
Check out TensorFlow AddOns to special use layers
and utilities, subclass Keras models for complicated training
pipelines.
Tune your hyperparameters with KerasTuner.
And don't forget to check out the entire ecosystem
of NLP tools built on top of TensorFlow.
So thank you so much to the authors of the presentation,
Karmel and Yash, to the illustrator
of the presentation, the artist responsible for all
of these drawings, and to all of you who are listening online.
Very excited to talk about all of these tf.keras and tf.text
improvements.
Thank you.
[MUSIC PLAYING]