Subtitles section Play video
SUHARSH SIVAKUMAR: We'll get started.
So hi, everyone, I'm Suharsh, and I'm here
to talk about the TensorFlow Model Optimization
Toolkit, which we have techniques
for quantization and pruning.
And feel free to ask questions or interrupt along the way.
I want this to be super interactive.
So what we're going to talk about today
is the high level of what quantization is, the challenge
it poses, and why it matters.
And then more of the specifics on,
in TensorFlow, what we are doing to work
on quantization and pruning.
So overall, quantization, the idea
is that you have your floating point network
with your inference graph, which is a floating point program.
And we're going to make modifications
to this program in the general sense
that we take these floating point calculations
and make them lower precision.
And the goal is to get as close in accuracy
as possible while providing some performance improvements.
So usually, this involves--
this is very general--
there's some function from the floating point to the integer
value.
There's a process to do the conversion
to make it valid for a particular hardware.
And then there's various algorithms
we have to get these parameters needed
for this function in the most efficient way.
So this is really general, and it may not make sense now,
but we'll make it more specific later.
Mhm?
AUDIENCE: Do the same conversion functions
work for mobile devices as well as specialized hardware?
SUHARSH SIVAKUMAR: No, and that's one of the challenges.
And we'll get to all the challenges.
That's a really good question [INAUDIBLE]..
Mhm?
AUDIENCE: I had another question.
Would do you also be [? motivating ?] soon why is
this not as simple as a [? downcast ?] from float
to int [INAUDIBLE]?
SUHARSH SIVAKUMAR: Yes, it'll all make sense, I hope.
AUDIENCE: We're obviously very interested.
SUHARSH SIVAKUMAR: So why does this matter?
So the first thing is that the ML programs have
lots of parameters, and by using lower precision,
we can instantly get these models a lot smaller, which
can help with memory bandwidth and network
costs of downloading models.
Second, if you have all your calculations in integers,
you could have lots of optimizations that
make the execution super fast.
Third, integers are super power efficient.
So on mobile, this is really important.
And then finally, this lets us explore a whole new avenue
of hardware design, where we can make custom chips like seastar
was the first, then Edge TPU, the new TPUs [? are ?]
having integer operations.
And this can get us cheap, power efficient, fast hardware.
[INAUDIBLE]
So--
AUDIENCE: I think it helps if instead
of saying integer operations, you
say fixed point [? fraction ?] operations.
[INAUDIBLE]
SUHARSH SIVAKUMAR: So I avoid it because it's
only kind of fixed point.
It's not like-- it is fixed point, but when I--
so I've said fixed point in the past,
and then folks always say it's not truly fixed point
because fixed point applies a rescale every time you combine
the two values, and sometimes I get pushback.
So I'm going to avoid--
because I used to say quantization,
and then people would say there's
a hundred steps of quantization.
So the integers are the key here because that's
what's providing the acceleration that's
specific to what we're doing in the TensorFlow stack.
And the specifics, I guess, will make sense
after we go into the equations.
So why is quantization hard?
And this was your point that we have different chips.
So each chip has its own specific tradeoffs
it chose to make.
Some may only support int8, some may support int16,
some way want power of 2 rescales.
All these really one-off decisions
to make the deployment story of how
do you take a general TensorFlow program,
put it on one of these chips-- really hard.
For float, we started to get to a world
where we can just say float can run anywhere.
But for these things, there's not
a lot of standardization on how to do this.
The second reason it's hard is it often requires custom tool
because you need extra metadata that often can only
be gathered by running inferences to know
how to quantize the values.
And we'll get more into that in detail.
So there's often an extra step in the process.
And then finally, for every specific ML problem,
we don't have a good answer for how
quantization will affect it.
You can use the same architecture,
but just do something else for your particular task
with the outputs of that architecture,
and quantization may help or hurt.
And it's pretty empirical right now,
where we just try it and see.
And we're still in the process of gathering a lot of examples.
But one of the goals we need to work on in ML research
is understand these models more to determine how quantization
error will impact things.
So now more into the detail.
So currently, what most hardwares implement
and what the TensorFlow and TensorFlow Lite stack
implement is affine quantization, which
this is like us milking y equals mx
plus b since seventh grade for the rest of our lives.
[LAUGHING]
So basically, you uniformly distribute your range
into fewer chunks than you had before,
and then bucketize them.
And this is effectively what all quantization is.
Currently, we have different ways of gathering statistics
to determine how to quantize.
So going back to this picture for a second,
we need some sort of min and max value to know how to quantize.
So this implies that we need tooling
to get this information.
And we have two types of tooling right now.
There's during training tooling, where
you can incorporate this as part of your training pipeline.
At the end of the day, you have a trained model
that has information on how to quantize it.
And you can also do this post-training,
and we'll talk about the trade-offs later.
AUDIENCE: I have a question.
So why are you [INAUDIBLE] not beyond the boundary
of your possible values?
Why do you choose-- or do you purposefully choose
to leave some values out?
SUHARSH SIVAKUMAR: So, yeah.
The min and max, it's kind of an open question
on what is the optimal min-max given a tensor, if I answered
the question right.
So you could choose to put your min-max much smaller
than your actual value seen, and you'll get some clipping.
And depending on the model and the problem,
we wouldn't really know if it's useful or not.
Because sometimes models don't care
about those extraneous values, and sometimes, they're
the most important thing the whole model.
AUDIENCE: The tricky thing is that when you set your min
and max, and if you're using int8,
you only have 255 values between the min and max.
Every [? activation ?] has to be cast
into one of those 255 values.
If you [INAUDIBLE] minus infinity [INAUDIBLE]
plus infinity, that's really useless.
But if your mean is 0 and your max is 0.01,
you can represent computations with a lot of precision,
so it's the trade-off.
SUHARSH SIVAKUMAR: Yeah.
And we do different types of these depending on the model.
And we've seen weird things where--
and it's always this battle between how much
does the network care about these extreme values versus how
much is it care about the average rounding
error along the way.
So it's always this rounding versus clipping-- that's all--
we just play with this lot.
AUDIENCE: You mentioned about min and max being primarily
influenced from training.
But [? I do ?] like to also do this [? add ?] [? infinite-- ?]
there's a constant feedback loop from--
SUHARSH SIVAKUMAR: So it's training or post-training.
So post-training might influence for like model compilation
time, and it doesn't stop.
AUDIENCE: Could I just clarify?
So the point is that it's no-- it wouldn't be considered
quantization if you just reduced float32
to float16, for example-- float8 or whatever.
So you still have a separate exponent
and you have just kind of fewer bits.
That's not considered quantization?
SUHARSH SIVAKUMAR: So technically, it is.
So the textbook term of quantization,
it is quantization.
But the quantization we're talking about here
is this integer quantization where
you have a shared min-max.
AUDIENCE: Where you really don't want it to have that--
SUHARSH SIVAKUMAR: [? We're ?] using that scale.
AUDIENCE: ----[INAUDIBLE] the exponent.
So that's the only thing that's useful for the hardware
[INAUDIBLE].
SUHARSH SIVAKUMAR: Exactly.
So in other like DSP literature, it's
sometimes called "block floating point," where
you have the exponent shared across all values of a tensor
rather than one exponent per element.
So in a way, float is just per element quantization.
Yeah.
So during training, the idea of during training quantization
is that you want to somehow get this network
to be robust to this error that quantization introduces.
So you emulate the effect of quantization
in the forward pass.
So if you ever see these TensorFlow fake_quant
operations, or the [? contra ?] [? quantized ?] rewriter tool,
this is its goal.
It's saying given a graph, we'll rewrite the forward pass
to emulate the error due to quantization,
and then in the backward pass, we'll
do some tricks to skip over those non-differential parts
that quantization introduces.
And then the goal is that that [? prompt ?] will magically
make the weights better for quantization.
And this can often get the best accuracy given
a particular schema of quantization,
but it's also really hard to train sometimes.
And machine learning, as we all know,
is the art of making as few changes
to your training to get it to converge.
And the second you do more, oftentimes, you
won't even converge if you go to too low of a precision.
And you just have to play around a lot
if you try the training route.
Additionally, the error introduced in training
is specific for a particular target.
So if you want the result of your training
to be portable and work across many different ships,
you're kind of in trouble now if they
have different characteristics.
AUDIENCE: So by emulating quantization,
does that mean on the forward pass after every op,
you just apply the quantization?
SUHARSH SIVAKUMAR: Yeah, and it's
a bit trickier than after every op,
because it's after every rescale that the hardware expects.
So a specific example is like, in TensorFlow,
you have con [? bias ?] [? that ?] [? value. ?] In most
of these inference backends, those are fused into one
[? fat ?] con [? bias ?] [? that ?] [? value. ?]
And your rescales are only at the inputs of the con
and the outputs of the [? value. ?] So only
your [? support ?] should emulate quantization there.
So you kind of need knowledge of what the target's expectations
are to decide where to put it.
So it's not just before and after every op.
AUDIENCE: And do you just use the current running max
and min?
SUHARSH SIVAKUMAR: Yeah.
So right now, we do moving average.
For certain models, we played with absolute min
and absolute max.
And it's really-- sometimes, we use schedules to slowly
manually constrain it.
And this is where the art part, and it's not really
well understood how to do that generally.
So right now, for all the mobile--
like, all the vision models, we do moving average.
And it seems to work pretty well,
but we don't know if that's optimal or not.
It just turns out backprop is kind of magical.
AUDIENCE: And backprop you don't apply this at all.
SUHARSH SIVAKUMAR: Backprop, we use
this thing called "straight through estimator," which
the main problem with this quantization
is it's a step function, so it's not differentiable.
So we pretend it's an identity, and we just
pass the gradient right through, and this gets [? it to ?]
train.
AUDIENCE: It works in practice.
SUHARSH SIVAKUMAR: Mhm?
AUDIENCE: And just to clarify, there's
never a case where quantization is used in training just
to speed up the training.
It's only used in training because of the idea
that it would speed up inference.
Is that correct?
SUHARSH SIVAKUMAR: So there is some work.
I don't know if it's ever used in practice,
but there's been a few papers over the years that
do do quantization for speeding up training as well.
But this particular one is always--
everything in this talk, the goal is for inference.
And so this is purely to emulate what's happening at inference.
And oftentimes, it will be slower than--
slower to train these models than to actually
just do a float point
AUDIENCE: And just to be sure, I thought
more than actually speeding up inference,
the goal with quantization [? over ?] training
is to actually reduce errors?
SUHARSH SIVAKUMAR: Yeah, to reduce the accuracy
that you get--
that you lose when you eventually go to inference.
But the ultimate goal of this whole tooling
is to enable inference performance
for some particular hardware.
So that being said, we've been trying
to work really hard to avoid the need for this
in most general cases.
During training will always be the most accurate
because you're letting that effort make up for it,
but we think we can get pretty far
with post-training techniques.
In after training, the trade-offs
are that you can't rely on this magical, huge hammer
of backpropagation to fix all your accuracies,
but you can do some things.
And additionally, the main benefit
is that the user doesn't have to retrain,
which is a pain because oftentimes, it won't converge,
you have to mess with hyperparameters,
your portability is gone.
So here, there's a compile step.
Or sometimes, like you were saying, even at runtime,
there's a step to collect these statistics to do that min-max.
So the second technique we have-- so
we'll get back to quantization for the majority of the talk,
but I just want to mention pruning.
So the other technique you have is
pruning, which the goal is to result in tensors in your model
that have many zeros.
And these-- so if you do arbitrary pruning,
where your resulting model has many zeros,
it's much more compressible.
And additionally, if you have a certain structure
to your pruning, or a certain percentage of sparsity,
you can have optimized kernels that accelerate things.
So the benefit is that you have so many repeated values in them
that you can just zip your file and you're good to go.
And then if you actually have hardware support for sparsity,
you can get faster kernels.
And one more point on pruning, which I think is kind of cool,
is that all the zeros--
since you have so many repeated zeros,
and zeros in quantization represent exactly,
it actually works really, really well with quantization,
and often helps quantization, which is kind of-- they're
like, compressing in two orthogonal ways, which
is kind of neat.
So now we'll talk about all the tools.
So yeah, last year, we released this model optimization toolkit
which is a suite of TensorFlow and TensorFlow Lite tools
that aim to make all these techniques doable,
and let us play around with trying out new things
with quantization and pruning.
So you can check that out here.
So here's my world famous hand.
This went on Twitter, and this is my hand [INAUDIBLE]..
AUDIENCE: You have [? tweeted ?] your hand, I think.
SUHARSH SIVAKUMAR: Yeah, that's true.
[LAUGHING]
We've been reusing these pictures way too much.
So we have quantization and sparsity.
So first, we'll deep dive in all the tools in quantization
in a bit more detail on how we actually do quantization.
So the first thing we've done in TensorFlow Lite
is try to understand for many of the canonical models
all the operations that are in there.
And what are some standard recipes
on how to implement these fixed point quantized kernels?
And the goal here is that we want some sort of endorsement
for a new hardware that comes in.
And we know that this is going to be a work in progress
because new chips are coming all the time.
They have different constraints, and they don't
want to listen to one standard.
But we want to be like some reference point
to where we can compare, oh, this new quantization scheme,
how does it compare to this?
So the goal with this is for a bunch of CPU reference ops
that have been tried on many models,
and we understand them to some extent.
So this is a bit more detail on how we actually
do the quantization.
So the bottom number line is the floating point scale,
and that histogram is a pretend distribution of values
in a particular tensor.
And the idea of quantization is instead of wasting all our bits
representing this range that we don't even use,
let's figure out only the part that the histogram lies in,
and only represent that with a smaller number of bits.
So the top number line is the integer equivalent
of that, where we took that histogram
and we just use these 255 buckets to represent the number
line.
So this is just that same affine equation.
At inference time, we actually have--
we change this min-max to two different things called
"scale" and "zero point."
And scale is the floating point size of every bucket,
and zero point is an integer value that corresponds exactly
to floating point 0.
And this turns out to be really important.
[? C ?] started to do this, and it
resulted in a lot of bias issues,
where for every multiply accumulate you have,
if you don't represent 0 exactly,
you just push this bias.
And then it also has a convenient thing of--
oftentimes, in the models, we do padding,
and it's just zero is just a special number
that we have to represent.
But the main thing is the [? cumulated ?] thing.
So this just to give some insight
into what these tools are actually--
why do we need the information.
So we won't go too much into depth here,
but here's the summary of our quantization spec.
And we have a per axis symmetric weights, per layer asymmetric
activations, and then the zero point
is-- all these things are in a [? sine ?] integer value.
And I'll explain each of these, actually, because right now it
won't make any sense.
So the first part of the specification is symmetry.
And the idea here is, do you want
to make your scale be able to represent values that are
really not centered around 0?
And this means often that that zero point--
I'll go back to the equation real quick--
that zero point here, do we want to have
the cost of that addition?
And depending on where this happens in your math,
it can be really expensive or not too expensive.
And so for symmetry, we've decided
to make weights symmetric, and the reason
is that since weights are constants,
the zero point is multiplied by the dynamic activations.
So this is a cost that you'd have
to do that's dependent on the input every time.
So having weights be asymmetric, every inference
has a cost that's additional.
And so weights being symmetric avoids this whole zero point
multiplication of activations.
And we can answer more later, but I won't
go too much in depth here.
So it's faster if we make weight symmetric.
And activations, they're only multiplied by a constant value,
so having them have this zero point is not too expensive.
So we leave them asymmetric, and the activations
are often [INAUDIBLE] and stuff, which are super asymmetric.
So we'd be throwing away a bit if we don't do that.
So the second thing we can play around with in quantization
is the granularity in which we decide to have these min-maxes
or scales.
And traditionally, we were doing per layer quantization--
or per tensor quantization.
For a given tensor, you only have one min-max.
But it turns out for convolutions and [INAUDIBLE]
convolutions, often, each channel of the convolution
has a really different distribution.
And when you only have one scale or one
min-max for the entire tensor, you're
doing a really poor job in each of these distributions.
So the idea of per channel quantization
is you have a min-max per channel.
And since this is not in the inner loop of your kernels,
it's really not too expensive, and gets a huge benefit
in accuracy--
effectively like an extra bit.
So now to the tools.
So the tool fragmentation is all,
how do we get these min-max values
that we need to do the quantization?
And so for weights, it's super easy.
Weights are static, so we can anytime just
look at the weights, read the min-max,
and quantize using those min-max.
So the problem always comes in dynamic values and activations
that you can only get an idea of the distribution
by actually running realistic inputs.
So the first, most naive, simplest idea
on how to do quantization is let's
read the quantization at the second we know it,
which is right at inference.
So during runtime, our graph is actually different.
Before our expensive multipliers, or [? math ?]
models, we take the float input value, measure the min-max,
use those to quantize on the fly.
So this is like a On operation of quantizing on the fly.
Then get the speed up of doing an int8, an int8 multiply
on your [? math ?] model, and then go back
to float at the end.
So the idea is here is you get the most realistic
min-max range for your activations
because you're using the one for this particular inference.
The flaws are that you can only really do
this on chips that have float support.
The second time we could do this is--
if we want the whole graph to be integer,
we want to avoid this runtime cost of measuring
the min-max because we don't want any float
on any edge of this graph.
So what we can do is simply move that to compile time.
And so you have your float model,
and we want to do some post-training figuring out
of what the values are for all these dynamic values.
So to do this, we need some representative data
that we can run through the model,
collect ranges then, and then fix those min-max values
for the activations.
And this means that we're not using the perfect min-max,
like we were for hybrid quantization before.
But we are working on getting a representative one,
and we never have to have float in our inference graphs,
so this can go to all those integer accelerators.
AUDIENCE: So wait, I had a question kind
of related to the previous slide.
So the choice of whether to do hybrid or not,
is that multifaceted based on improving accuracy
because now you get better min-maxes,
but also the hybrid needs to support the float biases,
right?
SUHARSH SIVAKUMAR: Yeah.
So it's really problem specific.
So we'll get a little bit into that later as well,
but the short answer is, yes, it's
multifaceted in that it usually is a good choice if you're
going to CPU.
It's a bad choice if you have models
that have large activations.
Like image models don't get a huge benefit from hybrid
because your cost of doing this on the fly quantization
is pretty big.
And then accuracy really improves
for models with small activations
because you're kind of getting a more representative range
for that small tensor.
AUDIENCE: And also if you want truly low latency inference,
maybe it's harder [INAUDIBLE].
SUHARSH SIVAKUMAR: Yeah.
Mhm?
AUDIENCE: I was going to ask, how much [INAUDIBLE] do you
get from the hybrid approach?
And that's pretty expensive if you have to--
SUHARSH SIVAKUMAR: Yeah.
It can be, and it really depends on the model.
So I think we have some specific numbers.
But it really shines in models that
are kind of memory bound, because your main cost is
this n cubed thing.
Your activations may not be too big,
but you're getting this huge benefit of really driving
that [? math ?] model.
So then the third tool is integer-only quantization--
or during training integer-only quantization So
this results in the same compatible graph
as that post-training integer quantization
in the previous slide, but the difference
is we're doing that introducing the quantization
into the training that we talked about before.
So we're working on keras APIs [INAUDIBLE]..
So the way this looks in--
the way this will look is you build your model as before,
and you just wrap it in this quantize wrapper.
And there'll be-- there's parameters too.
We won't go in too much detail.
For hybrid quantization, the way it looks
is you train your normal graph for TensorFlow,
and then you just enable a flag in the TF Lite converter.
So right now, we have hybrid and the post-training
only enabled in TF Lite because we want to make it general,
but right now we only have specifics on the hardware
capabilities of TF Lite, and we need
to know these to be able to do this.
So the way this looks is your normal TF Lite converter
indication, and you just add this optimizations default
flag.
And under the hood, this is just doing this hybrid [? quanta ?]
of just quantizing all the weights,
and leave the activations in float.
So performance.
First off, all these approaches get similar model size
reduction, in that you're simply taking 32 bits,
going to 8 bits, so you're getting a 4x reduction in size.
For latency, like here, we see the--
we do get a speedup in these image models,
but for a lot of them, we don't see too much of a speed
up as we would expect in quantization.
And it's because on-the-fly cost is actually pretty high.
AUDIENCE: What hardware is this?
Is this just like a CPU?
SUHARSH SIVAKUMAR: This is all CPU.
So like on accelerators, this will be--
the integer ones will really shine on custom accelerators.
So accuracy, we do see an accuracy drop
in a lot of these models.
And a lot of this, we are working on ways
to nudge weights at different times
during compilation to fix these accuracy issues.
And so all these, this is not like the gold standard
in what quantization can get in these techniques.
It's just a starting point.
So yeah, 4x reduction.
You see a 10% to 50% increase in convolution models on the CPU.
And then for memory bound models,
you really see a lot more.
And you often get most of the bang
of the buck of quantization from hybrid
in those models versus needing the full integer.
That being said, for accelerators,
you'd still need to go the full integer route.
So post-training integer quantization.
So this is also enabled in TF Lite.
You train the TensorFlow the normal way
you would a float graph, and then you provide one more
option into the converter.
And the way that looks is you do the same flag as before--
[? Optimize ?] default.
But now we need some data to figure out those dynamic ranges
at compile time rather than at runtime.
So this data generator you provide
needs to yield examples that you would
expect to see in practice.
And so for like image models, we just grab a few images
from [INAUDIBLE].
And usually, we see a couple hundred works well enough,
but it's probably very problem specific.
So under the hood, this is doing that post-training quantization
where we measure the absolute min and absolute max we
see for particular activations.
Mhm?
AUDIENCE: Why would a hybrid model be [INAUDIBLE]??
I mean, ultimately, you still have inferences still coming
in, so even if maybe the first one--
like the first 1,000 is slow, after 1,000,
you definitely have those statistics.
Why would you ever not just [INAUDIBLE] at that point?
SUHARSH SIVAKUMAR: That's the question, yeah.
And so--
AUDIENCE: [INAUDIBLE]
SUHARSH SIVAKUMAR: You could do that.
So oftentimes, it turns out these--
for like the RNN models, you actually get
an accuracy benefit from hybrid, which because--
AUDIENCE: Even if you had a bunch of data?
SUHARSH SIVAKUMAR: Even if you had a bunch
because each activation actually is getting a really unique
[? bridge. ?]
AUDIENCE: Because it's float.
SUHARSH SIVAKUMAR: Yeah.
And also because you can imagine in RNN,
that same op is actually going to change its distribution
based on which time step you're on.
And so it really ends up being problem specific there.
But you're right, for like image models,
we absolutely could be doing that.
So yeah, the example of representative dataset
is just how you would normally load data.
And you just yield examples of these images.
So now some numbers.
So before we had released this, the [? contra ?]
quantized rewriter-- which I'm not talking about in this talk
because it's deprecated for a more friendly 2.0 capable API.
But those are kind of the gold standard
in quantization accuracy numbers for these image classification
problems.
And what we've seen is that with these changes of per
channel into our quantization scheme,
post-training integer quantization, which
is the right column, gets pretty comparable on all these models
that matter at the moment.
And this is without anything fancy.
So Denali has been looking into a lot of cool tricks
that are figuring out how to get--
where the accuracy is going in post-training.
So these numbers should be improving as well.
But the takeaway here is that most things--
8-bit-- maybe we're good enough with post-training,
and only the experts really need to use
quantization-aware training.
So this is an example of quantization not working well--
is the first call.
Where SSD, it's the same base structure of MobileNet,
but what you're doing with your [? logits ?] is a lot more.
So quantizing actually introduces a lot more error
here, and we see over a percent drop
in post-training versus quantization-aware training.
And this higher better is [? wrong. ?]
[LAUGHING]
So the other two columns are new models,
and no one ever went about doing quantization-aware training
here because it was just too much work,
and because they tried post-training.
These were released after post-training was released,
and post-training did really well accuracy wise,
so they just didn't bother with quantization-aware training.
More models.
Style transfer, we got good results on quantization,
although there's not really a good metric for style transfer.
The metric is like, look at it and it looks good enough.
And then some speech models do really good.
Everything's great.
[INAUDIBLE]
So the benefit of post-training integer quantization
is similar size reduction, similar speed up on CPU,
and similar speed up on the CPU for RNNs and convolutions.
Even better for convolution because you
don't have this cost.
But the main thing this enables is all these integer
microcontrollers, all these integer accelerators can now--
we can run on them.
So here's the summary of the three tools.
And the flow should usually look like you try hybrid,
you see how you [? get ?] [? on ?] CPU.
If you want to go to an accelerator
or you want more in CPU, you do the post-training
where you just add some representative data set.
And then only as a last resort, once you see post-training not
getting good accuracy for you, try
quantization-aware training.
So similarly, we have tools for connection pruning, which
are during training techniques.
And so they have a similar API to the quantization-aware
training API.
And so the flow usually [? lets ?]
you build your keras model, you apply pruning
on the API you train.
And often, these pruning APIs are doing a lot less like--
they're very localized to your weights.
So they're not really tearing apart your graph
like quantization is.
And [INAUDIBLE] can attest to this, where the pruning was
a lot simpler implementation wise than quantization-aware
training, because for training-- for quantization,
you have to understand all the fusions of your backend,
whereas pruning is local to the weights.
And so the flow here is you train like normal,
and the resulting graph has many tensors
that have lots of zeros.
And right now, the flow is that you can compress your file
and it's smaller.
And in the future, we're working on TensorFlow Lite runtime
support for these sparse tensors and kernel support.
So additionally, you'll get out of the box size reduction
instead of having to do this manual compression,
and you'll get speed up--
AUDIENCE: For the sparsity, in the future,
when you say that it might be faster,
is it in the case of structured sparsity where you force
[INAUDIBLE] sparsity, or is it for arbitrary sparsity that it
might--
SUHARSH SIVAKUMAR: So it really--
yeah, so this is something where we're trying
to figure out two things.
Those particular questions for a given hardware, what do we
want, and how do we expose this in a way that
makes sense when all this-- there's
so much fragmentation for hardware and problems.
So for certain problems, if you do arbitrary sparsity,
you probably need like 99.9% sparsity
to get a speed up on a particular hardware.
And for CPUs, and particularly speech models,
we've already been doing structured sparsity
with certain block sizes like you're saying.
And so this training tool has the ability
to set your block size.
And right now, we're working on--
where we need to work on in the future, for a given
hardware, what is the standard block size you need for that.
And so yeah, you're absolutely right.
There's fragmentation [? too. ?] It's like,
will the problem will allow this level of sparsity
that you desire, and is the hardware you
target going to support that?
So yeah, for CPU, usually [INAUDIBLE]..
So the API here is similar to the quantization API.
You provide parameters on your schedule
for how you want to quantize.
And here, that final sparsity is an important number.
It's basically saying at the end of training,
how many values in all your weights do you want to be 0?
So yeah, [INAUDIBLE].
The coverage we found is very--
it works on a lot of models.
It seems to be a very general technique.
And as I said before, it works really well with quantization
as well.
So here's a graph that's kind of a confusing graph.
But it's how your accuracy is affected on MobileNet.
This is an example based on how much pruning you do.
And what we noticed is that there is often a lot of pruning
you get for free, and then there's a sudden cliff.
So the goal here is to, for your problem,
to play with the parameters and figure out
where is that sudden cliff, or where do you
want to lie on this curve?
So here, we see around like 75-ish percent.
You're doing pretty good until then.
AUDIENCE: What technique was used
to actually do the pruning?
SUHARSH SIVAKUMAR: So here, we do pruning based
on the low magnitude value.
So there's a mask, and then you update that occasionally,
and the mask is updated based on which values of your tensor
are closer to 0.
So more numbers-- works great.
Skip.
And then so in summary, quantization is hard
because it is problem specific, hardware specific,
and the tools have lots of trade-offs
depending on which problem with which hardware.
And then pruning, we're starting to get
into the space of accelerating pruning.
And right now, it's a great technique
for reducing model size, and we need
to explore how that's going to look for various hardware-- is
how we're going to expose this in a general way.
So otherwise, any questions I can answer?
AUDIENCE: So how does a CPU actually do the--
make [? its ?] multiplication given two inputs
with a min and max for each?
SUHARSH SIVAKUMAR: So the way it looks--
I don't know if I have anything to look at.
[INAUDIBLE]
Yeah, I'll try that.
AUDIENCE: It's like, the min and max are--
SUHARSH SIVAKUMAR: So I'll say it in words first.
If it doesn't make sense, I can try to find something.
So the way it actually looks at the--
works [? at ?] inference.
So let's ignore zero point for a moment,
because it just gets in the way.
So say we're just doing a matrix multiplication.
So your input has a certain range
which corresponds to a particular scale.
Your second input, your weight has a certain range which
corresponds to a certain scale.
So you have one scale, another scale, and then
your output has a third independent range
on the third scale.
So what we do is your int8 matrix multiplication actually
gets accumulated into int32 values.
So if you imagine that all those int32 values
in the accumulator, they have an implicit scale--
because you just multiply it--
of these two scales multiplied.
If you wanted to recover the float
value from these int32 values, you just
multiply by these two scales.
So that's not how it actually works, but I'm just
explaining the math.
And so then our goal is to eventually output int8 values
that lie on the output scale.
So what we do in practice is we want
to get from this int32 value that
has implicit scale of this scale and this scale--
s1 and s2-- and go to s3.
So we just multiply by s3 and divide by s1 and s2.
So we make a new scale that's those three values--
that fraction.
So that's how the-- so in practice, the inference just
looks like int8 times int8, int32,
do this one rescale which is this is s3 over s1, s2.
And then you're [INAUDIBLE] value, if that makes sense.
I could-- yeah.
AUDIENCE: So you don't have to do an integer division?
SUHARSH SIVAKUMAR: Yeah.
And so that rescale is a floating point value,
and we don't want to do that.
So we do decompose that into two integers, and sometimes a shift
depending if you're like-- sometimes,
your target only supports power of 2 scales
because it just wants to implement that as a shift.
So there's lots of-- that's a whole other thing, where
there's lots of ways to implement that
rescale [INAUDIBLE] trade-offs.
So what TF Lite does by default is we decompose it into two
integers, and do like a--
we almost emulate float.
AUDIENCE: [INAUDIBLE] being used in training.
Is that consistent with what--
I mean, [INAUDIBLE] in training--
could you describe it with something orthogonal?
SUHARSH SIVAKUMAR: Yeah.
There's a lot of techniques that we need to start including.
And so right now, these techniques
have been these kind of end-to-end
get-something-working type techniques.
Where first, there was no training,
so we went to quantization during training.
But more and more, like [? wesde ?]
I think you're talking about.
Like there's [? no such wesde, ?] where
the idea is if you're given a particular min-max,
what is the perfect range--
perfect distribution of values such to decrease quantization
error?
And the answer is the uniform distribution.
So [? wesde ?] tries to do this by introducing loss
into your training.
And we-- these things all are compatible with when
you train on the float model, but they're not
offered out of the box.
Because we have noticed things-- in some of my experiments,
I noticed that [? wesde ?] only works
well for a particular model after you've trained for a bit.
We still don't have general knowledge
on when exactly to use it.
So we should be offering all of these, and we plan to
in this toolkit as choices for users.
But yeah, that's a great technique.
AUDIENCE: I'm sorry, [INAUDIBLE] asking [INAUDIBLE] question.
You mentioned-- is training and post-training techniques,
and then also you can do it hybrid or you can do pure int.
So in this [? quadrant, ?] one kind of option was missing.
So you didn't-- you showed three examples,
but you implied that you wouldn't be doing in-training
quantization combined with the hybrid.
Why is that?
SUHARSH SIVAKUMAR: You're absolutely right.
And it's just that the tooling is not doing that right now.
But that's exactly the direction we want to go.
That use-- get some metrics on what error
the quantization is using, and use that to drive things like,
should we be doing one or the other?
Should we be doing 8 bits, or should we--
for this one tensor, does it make sense
to leave it in float, does it makes sense
to bump it up to 16?
But that's absolutely right, where--
AUDIENCE: So there's nothing inherently wrong with it.
It's just another option [INAUDIBLE]??
SUHARSH SIVAKUMAR: Absolutely.
And for context, like what we've added
now is that if you have ops in your graph that
don't support quantization, we just leave them in float.
So we're already starting to get in the direction
of partial quantization, but that's exactly the direction.
And the piece that's kind of missing
is these two information hooks, where
one is what is quantization doing to your problem task--
like, your error for your actual problem.
We can get things, like signal-to-noise ratio,
but oftentimes that's not too representative of what's
it doing to your task problem.
So one thing we need is for this op,
what is it doing to the problem?
And then we can make decisions like this.
And the other thing is some pluggable specification
of hardware that says for this hardware,
does it even support hybrid, because then it's
not an option.
But yeah, that's exactly what we need to be working on.
AUDIENCE: Thank you.
[MUSIC PLAYING]