Subtitles section Play video Print subtitles SUHARSH SIVAKUMAR: We'll get started. So hi, everyone, I'm Suharsh, and I'm here to talk about the TensorFlow Model Optimization Toolkit, which we have techniques for quantization and pruning. And feel free to ask questions or interrupt along the way. I want this to be super interactive. So what we're going to talk about today is the high level of what quantization is, the challenge it poses, and why it matters. And then more of the specifics on, in TensorFlow, what we are doing to work on quantization and pruning. So overall, quantization, the idea is that you have your floating point network with your inference graph, which is a floating point program. And we're going to make modifications to this program in the general sense that we take these floating point calculations and make them lower precision. And the goal is to get as close in accuracy as possible while providing some performance improvements. So usually, this involves-- this is very general-- there's some function from the floating point to the integer value. There's a process to do the conversion to make it valid for a particular hardware. And then there's various algorithms we have to get these parameters needed for this function in the most efficient way. So this is really general, and it may not make sense now, but we'll make it more specific later. Mhm? AUDIENCE: Do the same conversion functions work for mobile devices as well as specialized hardware? SUHARSH SIVAKUMAR: No, and that's one of the challenges. And we'll get to all the challenges. That's a really good question [INAUDIBLE].. Mhm? AUDIENCE: I had another question. Would do you also be [? motivating ?] soon why is this not as simple as a [? downcast ?] from float to int [INAUDIBLE]? SUHARSH SIVAKUMAR: Yes, it'll all make sense, I hope. AUDIENCE: We're obviously very interested. SUHARSH SIVAKUMAR: So why does this matter? So the first thing is that the ML programs have lots of parameters, and by using lower precision, we can instantly get these models a lot smaller, which can help with memory bandwidth and network costs of downloading models. Second, if you have all your calculations in integers, you could have lots of optimizations that make the execution super fast. Third, integers are super power efficient. So on mobile, this is really important. And then finally, this lets us explore a whole new avenue of hardware design, where we can make custom chips like seastar was the first, then Edge TPU, the new TPUs [? are ?] having integer operations. And this can get us cheap, power efficient, fast hardware. [INAUDIBLE] So-- AUDIENCE: I think it helps if instead of saying integer operations, you say fixed point [? fraction ?] operations. [INAUDIBLE] SUHARSH SIVAKUMAR: So I avoid it because it's only kind of fixed point. It's not like-- it is fixed point, but when I-- so I've said fixed point in the past, and then folks always say it's not truly fixed point because fixed point applies a rescale every time you combine the two values, and sometimes I get pushback. So I'm going to avoid-- because I used to say quantization, and then people would say there's a hundred steps of quantization. So the integers are the key here because that's what's providing the acceleration that's specific to what we're doing in the TensorFlow stack. And the specifics, I guess, will make sense after we go into the equations. So why is quantization hard? And this was your point that we have different chips. So each chip has its own specific tradeoffs it chose to make. Some may only support int8, some may support int16, some way want power of 2 rescales. All these really one-off decisions to make the deployment story of how do you take a general TensorFlow program, put it on one of these chips-- really hard. For float, we started to get to a world where we can just say float can run anywhere. But for these things, there's not a lot of standardization on how to do this. The second reason it's hard is it often requires custom tool because you need extra metadata that often can only be gathered by running inferences to know how to quantize the values. And we'll get more into that in detail. So there's often an extra step in the process. And then finally, for every specific ML problem, we don't have a good answer for how quantization will affect it. You can use the same architecture, but just do something else for your particular task with the outputs of that architecture, and quantization may help or hurt. And it's pretty empirical right now, where we just try it and see. And we're still in the process of gathering a lot of examples. But one of the goals we need to work on in ML research is understand these models more to determine how quantization error will impact things. So now more into the detail. So currently, what most hardwares implement and what the TensorFlow and TensorFlow Lite stack implement is affine quantization, which this is like us milking y equals mx plus b since seventh grade for the rest of our lives. [LAUGHING] So basically, you uniformly distribute your range into fewer chunks than you had before, and then bucketize them. And this is effectively what all quantization is. Currently, we have different ways of gathering statistics to determine how to quantize. So going back to this picture for a second, we need some sort of min and max value to know how to quantize. So this implies that we need tooling to get this information. And we have two types of tooling right now. There's during training tooling, where you can incorporate this as part of your training pipeline. At the end of the day, you have a trained model that has information on how to quantize it. And you can also do this post-training, and we'll talk about the trade-offs later. AUDIENCE: I have a question. So why are you [INAUDIBLE] not beyond the boundary of your possible values? Why do you choose-- or do you purposefully choose to leave some values out? SUHARSH SIVAKUMAR: So, yeah. The min and max, it's kind of an open question on what is the optimal min-max given a tensor, if I answered the question right. So you could choose to put your min-max much smaller than your actual value seen, and you'll get some clipping. And depending on the model and the problem, we wouldn't really know if it's useful or not. Because sometimes models don't care about those extraneous values, and sometimes, they're the most important thing the whole model. AUDIENCE: The tricky thing is that when you set your min and max, and if you're using int8, you only have 255 values between the min and max. Every [? activation ?] has to be cast into one of those 255 values. If you [INAUDIBLE] minus infinity [INAUDIBLE] plus infinity, that's really useless. But if your mean is 0 and your max is 0.01, you can represent computations with a lot of precision, so it's the trade-off. SUHARSH SIVAKUMAR: Yeah. And we do different types of these depending on the model. And we've seen weird things where-- and it's always this battle between how much does the network care about these extreme values versus how much is it care about the average rounding error along the way. So it's always this rounding versus clipping-- that's all-- we just play with this lot. AUDIENCE: You mentioned about min and max being primarily influenced from training. But [? I do ?] like to also do this [? add ?] [? infinite-- ?] there's a constant feedback loop from-- SUHARSH SIVAKUMAR: So it's training or post-training. So post-training might influence for like model compilation time, and it doesn't stop. AUDIENCE: Could I just clarify? So the point is that it's no-- it wouldn't be considered quantization if you just reduced float32 to float16, for example-- float8 or whatever. So you still have a separate exponent and you have just kind of fewer bits. That's not considered quantization? SUHARSH SIVAKUMAR: So technically, it is. So the textbook term of quantization, it is quantization. But the quantization we're talking about here is this integer quantization where you have a shared min-max. AUDIENCE: Where you really don't want it to have that-- SUHARSH SIVAKUMAR: [? We're ?] using that scale. AUDIENCE: ----[INAUDIBLE] the exponent. So that's the only thing that's useful for the hardware [INAUDIBLE]. SUHARSH SIVAKUMAR: Exactly. So in other like DSP literature, it's sometimes called "block floating point," where you have the exponent shared across all values of a tensor rather than one exponent per element. So in a way, float is just per element quantization. Yeah. So during training, the idea of during training quantization is that you want to somehow get this network to be robust to this error that quantization introduces. So you emulate the effect of quantization in the forward pass. So if you ever see these TensorFlow fake_quant operations, or the [? contra ?] [? quantized ?] rewriter tool, this is its goal. It's saying given a graph, we'll rewrite the forward pass to emulate the error due to quantization, and then in the backward pass, we'll do some tricks to skip over those non-differential parts that quantization introduces. And then the goal is that that [? prompt ?] will magically make the weights better for quantization. And this can often get the best accuracy given a particular schema of quantization, but it's also really hard to train sometimes. And machine learning, as we all know, is the art of making as few changes to your training to get it to converge. And the second you do more, oftentimes, you won't even converge if you go to too low of a precision. And you just have to play around a lot if you try the training route. Additionally, the error introduced in training is specific for a particular target. So if you want the result of your training to be portable and work across many different ships, you're kind of in trouble now if they have different characteristics. AUDIENCE: So by emulating quantization, does that mean on the forward pass after every op, you just apply the quantization? SUHARSH SIVAKUMAR: Yeah, and it's a bit trickier than after every op, because it's after every rescale that the hardware expects. So a specific example is like, in TensorFlow, you have con [? bias ?] [? that ?] [? value. ?] In most of these inference backends, those are fused into one [? fat ?] con [? bias ?] [? that ?] [? value. ?] And your rescales are only at the inputs of the con and the outputs of the [? value. ?] So only your [? support ?] should emulate quantization there. So you kind of need knowledge of what the target's expectations are to decide where to put it. So it's not just before and after every op. AUDIENCE: And do you just use the current running max and min? SUHARSH SIVAKUMAR: Yeah. So right now, we do moving average. For certain models, we played with absolute min and absolute max. And it's really-- sometimes, we use schedules to slowly manually constrain it. And this is where the art part, and it's not really well understood how to do that generally. So right now, for all the mobile-- like, all the vision models, we do moving average. And it seems to work pretty well, but we don't know if that's optimal or not. It just turns out backprop is kind of magical. AUDIENCE: And backprop you don't apply this at all. SUHARSH SIVAKUMAR: Backprop, we use this thing called "straight through estimator," which the main problem with this quantization is it's a step function, so it's not differentiable. So we pretend it's an identity, and we just pass the gradient right through, and this gets [? it to ?] train. AUDIENCE: It works in practice. SUHARSH SIVAKUMAR: Mhm? AUDIENCE: And just to clarify, there's never a case where quantization is used in training just to speed up the training. It's only used in training because of the idea that it would speed up inference. Is that correct? SUHARSH SIVAKUMAR: So there is some work. I don't know if it's ever used in practice, but there's been a few papers over the years that do do quantization for speeding up training as well. But this particular one is always-- everything in this talk, the goal is for inference. And so this is purely to emulate what's happening at inference. And oftentimes, it will be slower than-- slower to train these models than to actually just do a float point AUDIENCE: And just to be sure, I thought more than actually speeding up inference, the goal with quantization [? over ?] training is to actually reduce errors? SUHARSH SIVAKUMAR: Yeah, to reduce the accuracy that you get-- that you lose when you eventually go to inference. But the ultimate goal of this whole tooling is to enable inference performance for some particular hardware. So that being said, we've been trying to work really hard to avoid the need for this in most general cases. During training will always be the most accurate because you're letting that effort make up for it, but we think we can get pretty far with post-training techniques. In after training, the trade-offs are that you can't rely on this magical, huge hammer of backpropagation to fix all your accuracies, but you can do some things. And additionally, the main benefit is that the user doesn't have to retrain, which is a pain because oftentimes, it won't converge, you have to mess with hyperparameters, your portability is gone. So here, there's a compile step. Or sometimes, like you were saying, even at runtime, there's a step to collect these statistics to do that min-max. So the second technique we have-- so we'll get back to quantization for the majority of the talk, but I just want to mention pruning. So the other technique you have is pruning, which the goal is to result in tensors in your model that have many zeros. And these-- so if you do arbitrary pruning, where your resulting model has many zeros, it's much more compressible. And additionally, if you have a certain structure to your pruning, or a certain percentage of sparsity, you can have optimized kernels that accelerate things. So the benefit is that you have so many repeated values in them that you can just zip your file and you're good to go. And then if you actually have hardware support for sparsity, you can get faster kernels. And one more point on pruning, which I think is kind of cool, is that all the zeros-- since you have so many repeated zeros, and zeros in quantization represent exactly, it actually works really, really well with quantization, and often helps quantization, which is kind of-- they're like, compressing in two orthogonal ways, which is kind of neat. So now we'll talk about all the tools. So yeah, last year, we released this model optimization toolkit which is a suite of TensorFlow and TensorFlow Lite tools that aim to make all these techniques doable, and let us play around with trying out new things with quantization and pruning. So you can check that out here. So here's my world famous hand. This went on Twitter, and this is my hand [INAUDIBLE].. AUDIENCE: You have [? tweeted ?] your hand, I think. SUHARSH SIVAKUMAR: Yeah, that's true. [LAUGHING] We've been reusing these pictures way too much. So we have quantization and sparsity. So first, we'll deep dive in all the tools in quantization in a bit more detail on how we actually do quantization. So the first thing we've done in TensorFlow Lite is try to understand for many of the canonical models all the operations that are in there. And what are some standard recipes on how to implement these fixed point quantized kernels? And the goal here is that we want some sort of endorsement for a new hardware that comes in. And we know that this is going to be a work in progress because new chips are coming all the time. They have different constraints, and they don't want to listen to one standard. But we want to be like some reference point to where we can compare, oh, this new quantization scheme, how does it compare to this? So the goal with this is for a bunch of CPU reference ops that have been tried on many models, and we understand them to some extent. So this is a bit more detail on how we actually do the quantization. So the bottom number line is the floating point scale, and that histogram is a pretend distribution of values in a particular tensor. And the idea of quantization is instead of wasting all our bits representing this range that we don't even use, let's figure out only the part that the histogram lies in, and only represent that with a smaller number of bits. So the top number line is the integer equivalent of that, where we took that histogram and we just use these 255 buckets to represent the number line. So this is just that same affine equation. At inference time, we actually have-- we change this min-max to two different things called "scale" and "zero point." And scale is the floating point size of every bucket, and zero point is an integer value that corresponds exactly to floating point 0. And this turns out to be really important. [? C ?] started to do this, and it resulted in a lot of bias issues, where for every multiply accumulate you have, if you don't represent 0 exactly, you just push this bias. And then it also has a convenient thing of-- oftentimes, in the models, we do padding, and it's just zero is just a special number that we have to represent. But the main thing is the [? cumulated ?] thing. So this just to give some insight into what these tools are actually-- why do we need the information. So we won't go too much into depth here, but here's the summary of our quantization spec. And we have a per axis symmetric weights, per layer asymmetric activations, and then the zero point is-- all these things are in a [? sine ?] integer value. And I'll explain each of these, actually, because right now it won't make any sense. So the first part of the specification is symmetry. And the idea here is, do you want to make your scale be able to represent values that are really not centered around 0? And this means often that that zero point-- I'll go back to the equation real quick-- that zero point here, do we want to have the cost of that addition? And depending on where this happens in your math, it can be really expensive or not too expensive. And so for symmetry, we've decided to make weights symmetric, and the reason is that since weights are constants, the zero point is multiplied by the dynamic activations. So this is a cost that you'd have to do that's dependent on the input every time. So having weights be asymmetric, every inference has a cost that's additional. And so weights being symmetric avoids this whole zero point multiplication of activations. And we can answer more later, but I won't go too much in depth here. So it's faster if we make weight symmetric. And activations, they're only multiplied by a constant value, so having them have this zero point is not too expensive. So we leave them asymmetric, and the activations are often [INAUDIBLE] and stuff, which are super asymmetric. So we'd be throwing away a bit if we don't do that. So the second thing we can play around with in quantization is the granularity in which we decide to have these min-maxes or scales. And traditionally, we were doing per layer quantization-- or per tensor quantization. For a given tensor, you only have one min-max. But it turns out for convolutions and [INAUDIBLE] convolutions, often, each channel of the convolution has a really different distribution. And when you only have one scale or one min-max for the entire tensor, you're doing a really poor job in each of these distributions. So the idea of per channel quantization is you have a min-max per channel. And since this is not in the inner loop of your kernels, it's really not too expensive, and gets a huge benefit in accuracy-- effectively like an extra bit. So now to the tools. So the tool fragmentation is all, how do we get these min-max values that we need to do the quantization? And so for weights, it's super easy. Weights are static, so we can anytime just look at the weights, read the min-max, and quantize using those min-max. So the problem always comes in dynamic values and activations that you can only get an idea of the distribution by actually running realistic inputs. So the first, most naive, simplest idea on how to do quantization is let's read the quantization at the second we know it, which is right at inference. So during runtime, our graph is actually different. Before our expensive multipliers, or [? math ?] models, we take the float input value, measure the min-max, use those to quantize on the fly. So this is like a On operation of quantizing on the fly. Then get the speed up of doing an int8, an int8 multiply on your [? math ?] model, and then go back to float at the end. So the idea is here is you get the most realistic min-max range for your activations because you're using the one for this particular inference. The flaws are that you can only really do this on chips that have float support. The second time we could do this is-- if we want the whole graph to be integer, we want to avoid this runtime cost of measuring the min-max because we don't want any float on any edge of this graph. So what we can do is simply move that to compile time. And so you have your float model, and we want to do some post-training figuring out of what the values are for all these dynamic values. So to do this, we need some representative data that we can run through the model, collect ranges then, and then fix those min-max values for the activations. And this means that we're not using the perfect min-max, like we were for hybrid quantization before. But we are working on getting a representative one, and we never have to have float in our inference graphs, so this can go to all those integer accelerators. AUDIENCE: So wait, I had a question kind of related to the previous slide. So the choice of whether to do hybrid or not, is that multifaceted based on improving accuracy because now you get better min-maxes, but also the hybrid needs to support the float biases, right? SUHARSH SIVAKUMAR: Yeah. So it's really problem specific. So we'll get a little bit into that later as well, but the short answer is, yes, it's multifaceted in that it usually is a good choice if you're going to CPU. It's a bad choice if you have models that have large activations. Like image models don't get a huge benefit from hybrid because your cost of doing this on the fly quantization is pretty big. And then accuracy really improves for models with small activations because you're kind of getting a more representative range for that small tensor. AUDIENCE: And also if you want truly low latency inference, maybe it's harder [INAUDIBLE]. SUHARSH SIVAKUMAR: Yeah. Mhm? AUDIENCE: I was going to ask, how much [INAUDIBLE] do you get from the hybrid approach? And that's pretty expensive if you have to-- SUHARSH SIVAKUMAR: Yeah. It can be, and it really depends on the model. So I think we have some specific numbers. But it really shines in models that are kind of memory bound, because your main cost is this n cubed thing. Your activations may not be too big, but you're getting this huge benefit of really driving that [? math ?] model. So then the third tool is integer-only quantization-- or during training integer-only quantization So this results in the same compatible graph as that post-training integer quantization in the previous slide, but the difference is we're doing that introducing the quantization into the training that we talked about before. So we're working on keras APIs [INAUDIBLE].. So the way this looks in-- the way this will look is you build your model as before, and you just wrap it in this quantize wrapper. And there'll be-- there's parameters too. We won't go in too much detail. For hybrid quantization, the way it looks is you train your normal graph for TensorFlow, and then you just enable a flag in the TF Lite converter. So right now, we have hybrid and the post-training only enabled in TF Lite because we want to make it general, but right now we only have specifics on the hardware capabilities of TF Lite, and we need to know these to be able to do this. So the way this looks is your normal TF Lite converter indication, and you just add this optimizations default flag. And under the hood, this is just doing this hybrid [? quanta ?] of just quantizing all the weights, and leave the activations in float. So performance. First off, all these approaches get similar model size reduction, in that you're simply taking 32 bits, going to 8 bits, so you're getting a 4x reduction in size. For latency, like here, we see the-- we do get a speedup in these image models, but for a lot of them, we don't see too much of a speed up as we would expect in quantization. And it's because on-the-fly cost is actually pretty high. AUDIENCE: What hardware is this? Is this just like a CPU? SUHARSH SIVAKUMAR: This is all CPU. So like on accelerators, this will be-- the integer ones will really shine on custom accelerators. So accuracy, we do see an accuracy drop in a lot of these models. And a lot of this, we are working on ways to nudge weights at different times during compilation to fix these accuracy issues. And so all these, this is not like the gold standard in what quantization can get in these techniques. It's just a starting point. So yeah, 4x reduction. You see a 10% to 50% increase in convolution models on the CPU. And then for memory bound models, you really see a lot more. And you often get most of the bang of the buck of quantization from hybrid in those models versus needing the full integer. That being said, for accelerators, you'd still need to go the full integer route. So post-training integer quantization. So this is also enabled in TF Lite. You train the TensorFlow the normal way you would a float graph, and then you provide one more option into the converter. And the way that looks is you do the same flag as before-- [? Optimize ?] default. But now we need some data to figure out those dynamic ranges at compile time rather than at runtime. So this data generator you provide needs to yield examples that you would expect to see in practice. And so for like image models, we just grab a few images from [INAUDIBLE]. And usually, we see a couple hundred works well enough, but it's probably very problem specific. So under the hood, this is doing that post-training quantization where we measure the absolute min and absolute max we see for particular activations. Mhm? AUDIENCE: Why would a hybrid model be [INAUDIBLE]?? I mean, ultimately, you still have inferences still coming in, so even if maybe the first one-- like the first 1,000 is slow, after 1,000, you definitely have those statistics. Why would you ever not just [INAUDIBLE] at that point? SUHARSH SIVAKUMAR: That's the question, yeah. And so-- AUDIENCE: [INAUDIBLE] SUHARSH SIVAKUMAR: You could do that. So oftentimes, it turns out these-- for like the RNN models, you actually get an accuracy benefit from hybrid, which because-- AUDIENCE: Even if you had a bunch of data? SUHARSH SIVAKUMAR: Even if you had a bunch because each activation actually is getting a really unique [? bridge. ?] AUDIENCE: Because it's float. SUHARSH SIVAKUMAR: Yeah. And also because you can imagine in RNN, that same op is actually going to change its distribution based on which time step you're on. And so it really ends up being problem specific there. But you're right, for like image models, we absolutely could be doing that. So yeah, the example of representative dataset is just how you would normally load data. And you just yield examples of these images. So now some numbers. So before we had released this, the [? contra ?] quantized rewriter-- which I'm not talking about in this talk because it's deprecated for a more friendly 2.0 capable API. But those are kind of the gold standard in quantization accuracy numbers for these image classification problems. And what we've seen is that with these changes of per channel into our quantization scheme, post-training integer quantization, which is the right column, gets pretty comparable on all these models that matter at the moment. And this is without anything fancy. So Denali has been looking into a lot of cool tricks that are figuring out how to get-- where the accuracy is going in post-training. So these numbers should be improving as well. But the takeaway here is that most things-- 8-bit-- maybe we're good enough with post-training, and only the experts really need to use quantization-aware training. So this is an example of quantization not working well-- is the first call. Where SSD, it's the same base structure of MobileNet, but what you're doing with your [? logits ?] is a lot more. So quantizing actually introduces a lot more error here, and we see over a percent drop in post-training versus quantization-aware training. And this higher better is [? wrong. ?] [LAUGHING] So the other two columns are new models, and no one ever went about doing quantization-aware training here because it was just too much work, and because they tried post-training. These were released after post-training was released, and post-training did really well accuracy wise, so they just didn't bother with quantization-aware training. More models. Style transfer, we got good results on quantization, although there's not really a good metric for style transfer. The metric is like, look at it and it looks good enough. And then some speech models do really good. Everything's great. [INAUDIBLE] So the benefit of post-training integer quantization is similar size reduction, similar speed up on CPU, and similar speed up on the CPU for RNNs and convolutions. Even better for convolution because you don't have this cost. But the main thing this enables is all these integer microcontrollers, all these integer accelerators can now-- we can run on them. So here's the summary of the three tools. And the flow should usually look like you try hybrid, you see how you [? get ?] [? on ?] CPU. If you want to go to an accelerator or you want more in CPU, you do the post-training where you just add some representative data set. And then only as a last resort, once you see post-training not getting good accuracy for you, try quantization-aware training. So similarly, we have tools for connection pruning, which are during training techniques. And so they have a similar API to the quantization-aware training API. And so the flow usually [? lets ?] you build your keras model, you apply pruning on the API you train. And often, these pruning APIs are doing a lot less like-- they're very localized to your weights. So they're not really tearing apart your graph like quantization is. And [INAUDIBLE] can attest to this, where the pruning was a lot simpler implementation wise than quantization-aware training, because for training-- for quantization, you have to understand all the fusions of your backend, whereas pruning is local to the weights. And so the flow here is you train like normal, and the resulting graph has many tensors that have lots of zeros. And right now, the flow is that you can compress your file and it's smaller. And in the future, we're working on TensorFlow Lite runtime support for these sparse tensors and kernel support. So additionally, you'll get out of the box size reduction instead of having to do this manual compression, and you'll get speed up-- AUDIENCE: For the sparsity, in the future, when you say that it might be faster, is it in the case of structured sparsity where you force [INAUDIBLE] sparsity, or is it for arbitrary sparsity that it might-- SUHARSH SIVAKUMAR: So it really-- yeah, so this is something where we're trying to figure out two things. Those particular questions for a given hardware, what do we want, and how do we expose this in a way that makes sense when all this-- there's so much fragmentation for hardware and problems. So for certain problems, if you do arbitrary sparsity, you probably need like 99.9% sparsity to get a speed up on a particular hardware. And for CPUs, and particularly speech models, we've already been doing structured sparsity with certain block sizes like you're saying. And so this training tool has the ability to set your block size. And right now, we're working on-- where we need to work on in the future, for a given hardware, what is the standard block size you need for that. And so yeah, you're absolutely right. There's fragmentation [? too. ?] It's like, will the problem will allow this level of sparsity that you desire, and is the hardware you target going to support that? So yeah, for CPU, usually [INAUDIBLE].. So the API here is similar to the quantization API. You provide parameters on your schedule for how you want to quantize. And here, that final sparsity is an important number. It's basically saying at the end of training, how many values in all your weights do you want to be 0? So yeah, [INAUDIBLE]. The coverage we found is very-- it works on a lot of models. It seems to be a very general technique. And as I said before, it works really well with quantization as well. So here's a graph that's kind of a confusing graph. But it's how your accuracy is affected on MobileNet. This is an example based on how much pruning you do. And what we noticed is that there is often a lot of pruning you get for free, and then there's a sudden cliff. So the goal here is to, for your problem, to play with the parameters and figure out where is that sudden cliff, or where do you want to lie on this curve? So here, we see around like 75-ish percent. You're doing pretty good until then. AUDIENCE: What technique was used to actually do the pruning? SUHARSH SIVAKUMAR: So here, we do pruning based on the low magnitude value. So there's a mask, and then you update that occasionally, and the mask is updated based on which values of your tensor are closer to 0. So more numbers-- works great. Skip. And then so in summary, quantization is hard because it is problem specific, hardware specific, and the tools have lots of trade-offs depending on which problem with which hardware. And then pruning, we're starting to get into the space of accelerating pruning. And right now, it's a great technique for reducing model size, and we need to explore how that's going to look for various hardware-- is how we're going to expose this in a general way. So otherwise, any questions I can answer? AUDIENCE: So how does a CPU actually do the-- make [? its ?] multiplication given two inputs with a min and max for each? SUHARSH SIVAKUMAR: So the way it looks-- I don't know if I have anything to look at. [INAUDIBLE] Yeah, I'll try that. AUDIENCE: It's like, the min and max are-- SUHARSH SIVAKUMAR: So I'll say it in words first. If it doesn't make sense, I can try to find something. So the way it actually looks at the-- works [? at ?] inference. So let's ignore zero point for a moment, because it just gets in the way. So say we're just doing a matrix multiplication. So your input has a certain range which corresponds to a particular scale. Your second input, your weight has a certain range which corresponds to a certain scale. So you have one scale, another scale, and then your output has a third independent range on the third scale. So what we do is your int8 matrix multiplication actually gets accumulated into int32 values. So if you imagine that all those int32 values in the accumulator, they have an implicit scale-- because you just multiply it-- of these two scales multiplied. If you wanted to recover the float value from these int32 values, you just multiply by these two scales. So that's not how it actually works, but I'm just explaining the math. And so then our goal is to eventually output int8 values that lie on the output scale. So what we do in practice is we want to get from this int32 value that has implicit scale of this scale and this scale-- s1 and s2-- and go to s3. So we just multiply by s3 and divide by s1 and s2. So we make a new scale that's those three values-- that fraction. So that's how the-- so in practice, the inference just looks like int8 times int8, int32, do this one rescale which is this is s3 over s1, s2. And then you're [INAUDIBLE] value, if that makes sense. I could-- yeah. AUDIENCE: So you don't have to do an integer division? SUHARSH SIVAKUMAR: Yeah. And so that rescale is a floating point value, and we don't want to do that. So we do decompose that into two integers, and sometimes a shift depending if you're like-- sometimes, your target only supports power of 2 scales because it just wants to implement that as a shift. So there's lots of-- that's a whole other thing, where there's lots of ways to implement that rescale [INAUDIBLE] trade-offs. So what TF Lite does by default is we decompose it into two integers, and do like a-- we almost emulate float. AUDIENCE: [INAUDIBLE] being used in training. Is that consistent with what-- I mean, [INAUDIBLE] in training-- could you describe it with something orthogonal? SUHARSH SIVAKUMAR: Yeah. There's a lot of techniques that we need to start including. And so right now, these techniques have been these kind of end-to-end get-something-working type techniques. Where first, there was no training, so we went to quantization during training. But more and more, like [? wesde ?] I think you're talking about. Like there's [? no such wesde, ?] where the idea is if you're given a particular min-max, what is the perfect range-- perfect distribution of values such to decrease quantization error? And the answer is the uniform distribution. So [? wesde ?] tries to do this by introducing loss into your training. And we-- these things all are compatible with when you train on the float model, but they're not offered out of the box. Because we have noticed things-- in some of my experiments, I noticed that [? wesde ?] only works well for a particular model after you've trained for a bit. We still don't have general knowledge on when exactly to use it. So we should be offering all of these, and we plan to in this toolkit as choices for users. But yeah, that's a great technique. AUDIENCE: I'm sorry, [INAUDIBLE] asking [INAUDIBLE] question. You mentioned-- is training and post-training techniques, and then also you can do it hybrid or you can do pure int. So in this [? quadrant, ?] one kind of option was missing. So you didn't-- you showed three examples, but you implied that you wouldn't be doing in-training quantization combined with the hybrid. Why is that? SUHARSH SIVAKUMAR: You're absolutely right. And it's just that the tooling is not doing that right now. But that's exactly the direction we want to go. That use-- get some metrics on what error the quantization is using, and use that to drive things like, should we be doing one or the other? Should we be doing 8 bits, or should we-- for this one tensor, does it make sense to leave it in float, does it makes sense to bump it up to 16? But that's absolutely right, where-- AUDIENCE: So there's nothing inherently wrong with it. It's just another option [INAUDIBLE]?? SUHARSH SIVAKUMAR: Absolutely. And for context, like what we've added now is that if you have ops in your graph that don't support quantization, we just leave them in float. So we're already starting to get in the direction of partial quantization, but that's exactly the direction. And the piece that's kind of missing is these two information hooks, where one is what is quantization doing to your problem task-- like, your error for your actual problem. We can get things, like signal-to-noise ratio, but oftentimes that's not too representative of what's it doing to your task problem. So one thing we need is for this op, what is it doing to the problem? And then we can make decisions like this. And the other thing is some pluggable specification of hardware that says for this hardware, does it even support hybrid, because then it's not an option. But yeah, that's exactly what we need to be working on. AUDIENCE: Thank you. [MUSIC PLAYING]
B1 quantization training min inaudible integer max Inside TensorFlow: TF Model Optimization Toolkit (Quantization and Pruning) 2 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary