Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • JIAN LI: Hello, everyone.

  • My name's Jian.

  • I'm a software engineer on the TensorFlow team.

  • Today, my colleague Pulkit and I will

  • be talking about the TensorFlow model optimization toolkit.

  • Model optimization means transforming your machine

  • learning models to make them efficient to execute.

  • That means faster computation as well as a lower memory,

  • storage, and battery usage.

  • And it is focused on inference instead of training.

  • And because of the above mentioned benefits,

  • optimization can unlock use cases

  • that are otherwise impossible.

  • Examples include speech recognition, face unlock,

  • object detection, music recognition, and many more.

  • The model optimization toolkit is a suite

  • of TensorFlow and TensorFlow Lite tools

  • that make it simple to optimize your model.

  • Optimization is an active research area

  • and there are many techniques.

  • Our goal is to prioritize the ones that

  • are general across model architectures

  • and across various hardware accelerators.

  • There are two major techniques in the toolkit, quantization

  • and pruning.

  • Quantization stimulates flow calculation in lower bits,

  • and pruning forces zero interconnection.

  • Today we are going to focus on quantization

  • and we'll briefly talk about pruning.

  • Now let's take a closer look at quantization.

  • Quantization is a general term describing technologies

  • that reduce the numerical precision of static parameters

  • and execute the operations in lower precision.

  • Precision reduction makes the model smaller,

  • and a lower precision execution makes the model faster.

  • Now let's dig a bit more onto how we perform quantization.

  • As a concrete example, imagine we have

  • a tensor with float values.

  • In most cases, we are wasting most of the representation

  • space in the float number line.

  • If we can find a linear transformation that

  • maps the float value onto int8, we can reduce the model size

  • by a factor of four.

  • Then computations can be carried out between int8 values,

  • and that is where the speed up comes from.

  • So there are two main approaches to do quantization, post

  • training and during training.

  • Post training operates on a already trained model

  • and is built on top of TensorFlow Lite converter.

  • During training, quantization performs additional weight

  • fine-tuning, and since training is required,

  • it is a build on top of a TensorFlow Keras API.

  • Different techniques offers a trade off

  • between ease of use and model accuracy.

  • The most easy to use technique is the dynamic range

  • quantization, which doesn't require any data.

  • There can be some accuracy loss but we get a two to three times

  • speed up.

  • Because floating point calculation

  • is still needed for the activation,

  • it's only meant to run on CPU.

  • If we want extra speed up on CPU or want

  • to run the model on hardware accelerators,

  • we can use integer quantization.

  • It runs a small set of unlabeled calibration data

  • to collect the min-max range on activation.

  • This removes the floating point calculation

  • in the computer graph, so there is a speed up on CPU.

  • But more importantly, it allows the model

  • to run on hardware accelerators such as DSP and TPU,

  • which are faster and more energy efficient than CPU.

  • And if accuracy is a concern, we can

  • use Quantization Aware Training to fine-tune the weights.

  • It has all the benefits of integer quantization,

  • but it requires training.

  • Now let's have a operator level breakdown on the post training

  • quantization.

  • Dynamic range quantization is fully supported

  • and integer quantization is supported

  • for most of the operators.

  • The missing piece is the recurrent neural network

  • support, and that blocks use cases

  • such as speech and language where a context is needed.

  • To unblock those use cases, we have recently

  • added a recurrent neural network quantization

  • and built a turnkey solution through the post training API.

  • RNN model build with Keras 2.0 can be converted and quantized

  • with the post training API.

  • This slide shows the end to end workflow

  • in the post training setup.

  • We create the TensorFlow Lite converter

  • and load the saved RNN model.

  • We then set the post training optimization flags

  • and provide calibration data.

  • After that, we are able to call the convert method to convert

  • and quantized the model.

  • This is the exact same API and workflow for models

  • without RNN, so there is no API change for the end users.

  • Let's take a look at the challenges

  • of the RNN quantization.

  • Quantization is a lossy transformation.

  • RNN cell has a memory state that persists

  • across multiple timestamps, so quantization errors

  • can accumulate in both the layer direction and the time

  • direction.

  • RNN cell contains many calculations,

  • and determining the number of bits and the scale

  • is a global optimization problem.

  • Also, quantized operations are restricted

  • by hardware capabilities.

  • Some operations are not allowed on certain hardware platforms.

  • We solved the challenge and created the quantization spec

  • for RNN.

  • The full spec is quite complicated,

  • and this slide shows this spec by zooming

  • into one of the LSTM gates.

  • As I mentioned, there are many calculations in one cell.

  • To balance performance and accuracy,

  • we keep eight bit calculations as much as possible

  • and it only goes to higher bits when required by accuracy.

  • As you can see from the diagram, metrics

  • related operations are in 8 bit, and web related operations

  • are a mixture of 8 bit and 16 bits.

  • And please note, the use of higher bits

  • is only internal to the cell.

  • The input and output activation for RNN cell are all 8 bits.

  • Now we see the details of RNN quantization.

  • Let's look at the accuracy and the performance.

  • This table shows some published accuracy numbers

  • on a few data sets.

  • It's a speech recognition model that consists

  • of 10 layers of quantized LSTM.

  • As you can see, integer quantized model

  • has the same accuracy as the dynamic range quantized model,

  • and the accuracy loss is negligible

  • compared with the float case.

  • Also, this is a permanent model, so RNN quantization

  • works with pruning as well.

  • As expected, there is a four time model size reduction

  • because static weights are quantized to 8 bits.

  • Performance-wise, there is a two to four times

  • speed up on a CPU and a more than 10 times speed

  • up on DSP and TPU.

  • So those numbers are consistent with the numbers

  • from other operators.

  • So here are the main takeaways.

  • TensorFlow now supports the RNN/LSTM quantization.

  • It is a turnkey solution through the post training API.

  • It enables smaller, faster, and a more energy

  • efficient execution that can run on DSP and TPU.

  • There are already production models

  • that use the quantization.

  • And please check the link for more details on the use cases.

  • Looking forward, our next step will

  • be to expand quantization to other recurrent neural

  • networks, such as the GRU and SRU.

  • We also plan to add Quantization Aware Training for RNN.

  • Now I'll hand it over to my colleague Pulkit.

  • Thank you.

  • PULKIT BHUWALKA: Thanks.

  • Thanks Jian.

  • Hi, my name is Pulkit.

  • I work on model optimization tool kitting.

  • And let's talk about--

  • clicker doesn't seem to be working.

  • Sorry, can we go back a slide?

  • Yes.

  • Quantization Aware Training.

  • So Quantization Aware Training is a training time technique

  • for improving the accuracy of quantized models.

  • The way it works is that we introduced

  • some of the errors which actually happened

  • during quantized inference into the training process,

  • and that actually helps the trainer learn around

  • these errors and get a more accurate model.

  • Now let's just try to get a sense of why is

  • this needed in the first place.

  • So we know that quantized models,

  • they run in lower precision, and because of that,

  • it's a lossy process, and that leads to an accuracy drop.

  • And while quantized models are super fast and we want them,

  • but nobody wants an accurate model.

  • So the goal is to kind of get the best of both worlds,

  • and that's why we have this system.

  • To get a sense of why these losses get introduced,

  • one is that we actually have a--

  • once we have quantized models, these parameters

  • are in lower precision.

  • So, in a sense, you have more coarse information, fewer

  • buckets of information.

  • So that's where you have information representation

  • loss.

  • The other problem is that, when you're actually

  • doing these computations, then you have computation loss

  • when you're actually adding to coarse values instead

  • of finer buckets of values.

  • Typically, during matrix multiplication type

  • of operations, even if you're doing it at int8,

  • you accumulate these values to int32,

  • and then you rescale them back to int8,

  • so you have that rescaling loss.

  • The other thing is that, generally,

  • when we run these quantized models during inference,

  • there are various inference optimizations that

  • get applied to the graph, and because of that,

  • the training graph and the inference graph

  • can be subtly different, which also can potentially

  • introduce some of these errors.

  • And how do we recover lost accuracy?

  • Well, for starters, we try to make the training graph as

  • similar as possible to the inference graph

  • to remove these subtle differences.

  • And the other is that we actually

  • introduce these errors which actually happened

  • during inference, so the trainer learns around it

  • and machine learning does its magic.

  • So for example, when it comes to mimicking errors,

  • as you can see in the graph here,

  • you go from weights to lower precision.

  • So let's say if your weights are in floating point,

  • you go down to int8, and then you go back up

  • to floating point.

  • So in that sense, you've actually

  • mimicked what happens during inference when you're

  • executing at lower precision.

  • Then you actually do your computation,

  • and because both your inputs and your weights are at int8

  • and the losses have been introduced,

  • the computation happens correctly.

  • But then after the computation, you

  • add another fake quant to kind of drop

  • back to lower precision.

  • The other thing is we model the inference part.

  • So for example, if you noticed in the previous slide,

  • the fake quant operation came after the value activation.

  • So this is one of the optimizations

  • that happened during inference, that the value gets folded in.

  • And what we do is that when we're actually constructing

  • your graph, we make sure that these sorts of optimizations

  • get added in.

  • And let's look at the numbers.

  • So the numbers are pretty good.

  • So if you look at the slide, we're

  • almost as close as the float baseline on various version

  • models that we've tried.

  • So this is really powerful.

  • You can actually execute a model which

  • gives you nearly as good accuracy and is quantized.

  • So what's the value to users?

  • Well, you have on the one hand a simple, almost one

  • line API that you can use to quantize your model,

  • train it, convert it, and go ahead and execute it.

  • This works great for app developers, ML engineers, et

  • cetera.

  • You might want to go one step ahead,

  • and then we have a slightly more complicated API,

  • where it's like, hey, you can kind of configure

  • your quantization however you want,

  • and this would be something that's

  • quite useful to ML engineers, some researchers.

  • And if you want to go completely out there,

  • you can actually completely configure

  • quantization algorithm schemes, different bits, et cetera,

  • what do you want, and this provides a very good fertile

  • ground for researchers or hardware engineers.

  • So basically, the API is, easy is easy, hard is possible.

  • So let's look at how do we do this.

  • So well, this is your standard Keras model.

  • If you want to, let's say, quantize

  • your entire model, typically you construct the model,

  • import tensorflow as tf, model.compile, model.fit,

  • go ahead, right?

  • Now, let's look at what quantizing the model

  • looks like.

  • Pretty much the same thing, right?

  • Import tensorflow_model_optimization

  • as tfmot.

  • That's the package you put in.

  • You construct your model, quantize the model,

  • and then just go ahead.

  • You do your compile fit, all of that, continue with that.

  • Now, you might not want to quantize the entire model.

  • Maybe you want to quantize a subset of your model,

  • because some parts of the model are either most sensitive

  • to quantization losses, or you want

  • to get the most performance out of them.

  • So you want to quantize only a part of your model.

  • And in that case, it's still pretty simple, just

  • slightly different.

  • So for example, you have a quantized annotate layer.

  • You tell it which layers you want to quantize,

  • and then you apply it at the end,

  • and then you're good to go.

  • Beyond that, you might want to control the quantization

  • within a layer.

  • So for example, you have a particular layer,

  • but you want to control which weights you want to quantize,

  • how you want to quantize it.

  • And in that case, also, it's pretty similar API.

  • You use quantize annotate layer, but when you actually

  • pass in the layer, you also pass in a specific config,

  • and this config tells the infrastructure

  • how to actually quantize this layer.

  • And the rest of the API is the same.

  • Let's look at how you define this config.

  • So this config is largely telling us two things.

  • One is that what is it within that layer

  • that you want to quantize, and the other

  • is how you want to quantize it.

  • So you tell us which weight or which activation

  • you want to quantize.

  • And the other thing is you tell us--

  • you give us-- pass us a quantizer,

  • and this quantizer is basically an object

  • that encapsulates kind of the algorithm about how

  • to quantize this.

  • We give you a bunch of built-in ones,

  • but you can write your own.

  • You might want to quantize your own layer.

  • So let's say you have a special algorithm,

  • like a fancy convolutional layer that you write,

  • and you want to quantize that as well.

  • Well, you do it almost in exactly the same way.

  • You quantize annotate your layer, you pass in a config,

  • and this config tells us how we should

  • quantize your fancy layer.

  • And again, you tell us what to quantize, how to quantize it.

  • And in this case, what do you look--

  • what you notice is that there is like a histogram quantizer,

  • and this is, let's say, a special quantizer.

  • And a special quantizer is interesting,

  • because that allows you to completely control

  • what sort of strategy you're using to quantize your model.

  • You, in this case, could use a histogram

  • to determine the range and then quantize it.

  • And that's how you would write the algorithm.

  • And it's pretty simple.

  • You just implement two methods.

  • One is build, which is basically for you to construct

  • any variables you need.

  • And then in the call method, we give you a bunch of tensors,

  • you quantize them however you wish.

  • You return us the tensors and we'll take care of the rest.

  • And it doesn't end here.

  • We actually provide you the ability

  • to completely kind of define your own schemes,

  • specify how each layer should be quantized,

  • going so far as to kind of--

  • I mentioned earlier that you can--

  • we fuse the values for you, for example,

  • so you can actually define your own kind of transforms,

  • which tell what sort of manipulations

  • you want to do on the graph.

  • So in summary, Quantization Aware Training

  • is an API which helps you recover

  • your accuracy while getting the benefits of quantization.

  • It's a pretty simple API for easy tasks,

  • but quite flexible if you want to do more complicated things.

  • And it simulates quantization loss

  • that happens on various different backends and schemes.

  • You can kind of configure that.

  • There are cooler things coming up.

  • We released the sparsity training

  • time API sometime back.

  • But now we're working on sparse kernel execution,

  • and that's coming up.

  • And then you'll have an end to end story,

  • that you can train sparse models and execute it on device.

  • And you can also use quantization and sparsity

  • together, and that's quite powerful when they go together.

  • So that's the model optimization toolkit.

  • It's a suite of tools that make your models faster and smaller.

  • Quantization and sparsity are the main techniques

  • that we have.

  • You can find us on github/model-optimization.

  • Please file any requests, concerns, bugs, or feedback

  • that you have, and we're always working

  • on making those models smaller and faster.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it