Subtitles section Play video Print subtitles RAZIEL ALVEREZ: Hi, my name is Raziel. I lead TensorFlow model optimization. And today I will talk about our toolkit and, in particular, about techniques around quantization and neural connection pruning. So first I'll introduce, what is model optimization? What is our toolkit? And then the reason about why we think it's important, why we're investing in this area. Then I'll cover the tools that we have available. And at the end, I will give an quick overview about the roadmap for the short term and the longer term. And hopefully at the end of the presentation, we still have some minutes to go over Q&A. So our toolkit implements techniques that should allow you to optimize machine learning models for deployment and execution. We think this is important because machine learning is everywhere, right. It's a very important field, and we think that there is a lot of room to make it more efficient. And this has some implications, both like economy-- like, you can make all these applications much better, the quality or cheaper to execute. But also we can enable some new models, and new deployments, and new products that otherwise are not possible even if you just tried to execute these machine learning models on servers. So currently machine learning runs either on the server or on the Edge. On the server, you may think that there is a lot of capacity, that there's a lot of compute and memories. What is the benefit of optimizing these models? Well, applications are still bound by latency. There is still a very important metric for a lot of applications, or you want to improve the throughput-- how many tasks can run on your server. And these two are also directly correlated to money, right. So everybody will want to save money, and potentially, we're talking a lot of money. Now on the Edge is a little bit more obvious why we need optimization. These are a very resource-constrained environment even if you're talking about applications in general. We need to deal with reduced memory, computing, power consumption is typically an issue, bandwidth, both downloading, models from the Cloud. Or even we've seen the chips to be able to transfer parameters from memory to the processor, this could be a problem if the model is too large. Plus, we have a wide variety of hardware, more than in the server, and we need to make sure that these models run efficiently on all these different types of hardware. So it follows that if we are optimizing these models, and we have better models, eventually, it starts translating enabling new products that otherwise couldn't exist if we just were running these models on a server. And these opportunities are larger than just on smartphones. Machine learning, it is trickling down into more environments. We have machine learning models, for example, that are used to detect failures in machinery in factories, or we use it in self-driving cars. We use it in the office to scan documents and try to understand them. And just to give you some numbers, right-- like, the size of the smartphone market is really a fraction of the potential for the Edge devices, in general. So basically the two reasons is we want to make machine learning model efficient. It's already very important for the servers, but it is pretty crucial for embedded devices. So we started this toolkit about a year ago. We initially launched post-training quantization with this hybrid type of quantization, and I'll go in more detail later in the presentation. Then earlier this year, we launched the API for neural connection pruning, then we created this specification of quantized operations, integer quantized operations for TensorFlow Lite. And we launched also post-training quantizations for targeting this specification. More recently, we added support for reduced flow precision, and hopefully soon we're going to be launching quantization of our training API and also add in support to TF Lite for Sparse computation. So now I'll go into these techniques and these tools in a little bit more detail. So let's start with quantization. But first I think it's important we have some at least basic understanding of what is quantization, and why it's hard, and why we are approaching our tools the way that we're doing it? So let's start with a simple example. You know, matrix multiply, it's a basic operation for machine learning models. You have two matrices, two tensors A and B, and then you do some multiplication accumulation, and then you get a third tensor, C. So each tensor issues a bunch of values that produce the third tensor. Then just a little reminder about how matrix multiply works, each one of these results of the tensor C are computed as multiplications and accumulations. So if we look at one of them, and then if we think how we're training these models, typically in a higher precision-- let's say float 32-- then it follows that the operations of the multiplications are float 32 in precision. And then the product will be also a float 32, right. And then the accumulation will be also float 32. So this is fairly straightforward. There is some loss in precision, but machine learning is pretty good at dealing with it at least at this level of precision. So, no problem. Now what does this have to do with quantization? Well, let's go back to what are our goals for quantization. We want to be able to address all these restrictions, and also we want to be able to deploy as much hardware as possible. So a common thing that we do is, let's reduce the precision that we operate in. Let's say, for example, go from the 32-bit floats to 8-bit integers, and then let's operate, let's say, entirely within integer operations. And this will be good because then we are going from 32 bits to 8 bits, so we reduce the memory. You know, the models are four times smaller. Then integer operations are typically faster to execute. They also consume less power. And then because the parameters and also the dynamic values activations are smaller, then we reduce bandwidth pressure. It means that in the pipes in the chips, there is more room for things to flow around, which can also translate into faster compute and reduced power. And then integer operations seem to be a fairly common denominator across hardware. CPUs, DSPs, different NPUs, they support the integer operations. So OK, we are going to reduce the precision, so how do we convert the 32-bit float to 8-bit integers? Well, right now we do something very simple. So we have this linear mapping, where we say, OK, we take the values from a tensor. We compute the minimum and the maximum value, and then based on that, we spread them evenly on the 8-bit range. This basically is very simple, right. So is that all that we need to do? Well, I wish. It's not that's simple, so let's go back to the example. So we have the matrix multiply. And now let's say that we quantized the values, and they are already 8-bit integers. So the operands and the multipliers are 8-bit integers. And then the multiplication, the products, now you need 16 bits to represent it, and then you need to accumulate. And you probably want 32 bits to accumulate on that, right. So what is the problem? Well, the problem is that now you have tensor received a float of 32-bit values. And that is not great when you want to fit that into another matrix multiply, that you really want to execute as 8-bit integers, because we already taught they're resource efficient, right. So what do you do? So you scale them back down, right. So we just sort of quantize them on the fly now back down to 8-bit integers. So then you can feed them to your major 8-bit matrix multiplier, and now it's all good. But what are the implications of all this process? Well, so it means that we're changing the static values, the parameters, the weights. We are changing also the dynamic values, the activations, because we're, you know, scaling them-- quantizing them on the fly. And it also means that we are changing the computation, right. In this case, it's a very simple example. We just added a scaling operation, but it can be a bit more involved. So you could say, OK, that doesn't seem that hard, right. Like, we just added another number scaling operation, and it's just easy, right. Well, some math is a little bit more complicated than that. This example actually is from layer normalization of an LSTM. And this one, aside from looking a bit more complex, it's an example of-- where if you just apply these naive rules of operate, rescale, operate, rescale-- you actually end up in sort of a numeric black hole where the scales cancel each other, and basically just things don't work if you just go naively about it. And then it's more complicated because in your decisions about how you're going to represent this computation in integer form-- you know, quantization, we want to be efficient, so lower precision is good. But we also want to be accurate, which means lower precision is bad. So it's a lot of trade-offs that you have to do. Then further complicating things, we have heterogeneous hardware. There is all different types of hardware with different capabilities, very different operations that each hardware supporters, and also different preferences. Some hardware is, you know, better at executing operations and produce floats, you know, with different bit widths or different restrictions. And we want to account for all the things when we are creating our quantized recipe, or a quantized program. Then there is the fact that machine learning is hard to interpret. We don't understand it. Like, we don't understand how it works-- not to the level that we can't have good proofs to know that the transformation that we're doing to this model, to this program, will actually work or not result in a catastrophic error, right. You don't want to take a model, you quantize it, and then this model suddenly starts giving you some weird results. So it makes it much more complicated to define these transformations. And then finally, I will say that this is a little more complicated, because the model is not enough. The program is not enough. Depending on how the quantization is defined, you might also need some extra data. So an example of the matrix multiplier, we need it to compute the minimum and maximum values of the dynamic activations, and that only can be done if we run inferences through the model, which means that you need to provide some representatives there. So basically, it's just another hurdle that you have to account when quantizing these programs. So, you know, basically, this means that, when we talk about quantization, we're really talking about rewriting, transforming these machine learning programming, to an approximate representation based on the operations that you have available. So now how are we addressing this in our toolkit? Well, the first thing that we decided to do was to try to scope down the problem and say, OK, we're going to define the specifications for common operations-- like, in these cases, a diagram for convolution. Have a well-defined quantization behavior. So we know that now, with this information, this low-level information that is relevant to quantization, then hardware can target those specifications. And then our tools can target that specification, and then we can all work at this level. And then we also get the other benefit that, from the user point of view, you can quantize a model, and then this model can run in different hardware without any change. So right now, in order to give you support the three different quantization types. I'm including, here, reduced float as a quantization type. It's just a much simpler thing where we just typically go from float 32 to float 16 parameters and computations, so that's pretty straightforward. The next one is our hybrid quantization which basically makes use of 8-bit for the parameters. Biases and activations, we leave at 32-bit floats. And then we try to be as smart as possible for how we execute this program. So the goal being that, for example, heavy operations like big matrix multipliers are left in the integer domain, and then we use floating point for things like activation functions. So it's a nice trade-off between accuracy, and performance, and optimizations. Then the third one is integer quantization. This means everything is integers. All the parameters are integers, and all the operations are integers. This is obviously the more complicated one. So the benefits of the reduced float is-- well, your models are now half size. And then depending on the hardware support, you may get some speed-ups, and then the accuracy losses tend to be very minimal. It pretty much always works. I haven't seen, myself at least, an actual model, that doesn't work in float 16, that was trained in float 32. Hybrid quantization then pushes it further. You now get 4x reduced in size. And then depending on the computations of the operations that you're using, you may get different performance improvements. It tends to be larger for fully-connected models or RNNs. And then the third one is the only integer quantization, so it has the same benefits as hybrid in terms of memory size. But it's faster to execute, and it has a great advantage that it has more hardware coverage. So for example, typical MPUs, some of them are only, like, integer-based like over Edge-TPU. Now let's talk about the tools to actually quantize the models based on those quantization types. So we have two types of tools-- one that works post training. So it works directly on the training model. And the other one that is a work-in-progress is during training. So let's talk about the post training. The process is very simple. You basically assume that you just have a train. Well, it doesn't really care how you train it. You just have a TensorFlow model. Then currently, via the TensorFlow Lite converter, you just convert this model to TensorFlow Lite and quantize it on the fly. And then you just have a model that you can execute on whatever hardware is supported in that quantization type. So now let's look at the specific quantization types. So the first one is reduced float. You just add a couple of flags. You just use default of optimizations, and then the type that you're targeting is float 16. And then basically, this will take care of changing all the parameters and the computation, and again, depending on the hardware that you're running this model, you might get a speed-up right now. For example, GPUs support float 16 natively, so you might get some speed-up there either because of the computation or even just because the bandwidth in your chip will be reduced. Like I said, benefits-- all the size goes to half. And, you know, the accuracy drop is very minimal. I will say within the noise. Then the next one is our hybrid quantization. So again, this is very easy. You just set the flag now. This is the default for TensorFlow Lite converter. You set it to default. And then again, it will make sure to quantize all the parameters. And then operations that doesn't yet define a specification for the quantized form, they will be kept in the original position. And then you will get some speed-ups, and you will be able to execute whatever hardware complies with the specification. So typically, this one works pretty well for CPUs. And again, benefits-- 4x compression for the models. And then you get some speed-ups. All these are convolution-based models, so that's why the speed-up is not as big. And I will say these are one-year-old numbers, so probably right now it's faster. And the same for accuracy, accuracy is pretty good. And actually we're working on some changes for convolution models. It will even be a bit more accurate soon. Then the third one is the integer quantization. So this one is the one that is a bit more complex, because now you need to provide some data. So you say, OK, I want optimize the model, but I want to use the integer quantization. So now you need to provide some data. And by data, I mean on label samples of what your neural network will typically see in reality. So if it's an image processing model, you need to feed some pre-processed images. And we're not talking about a lot of data. For the results that I'm going to show next, we're just talking about a hundred samples. That works pretty well. So it is a bit more complicated, but it's not very complicated. So these are some results from post training quantization across different models. As you see for the majority of models, the loss is not that big with respect to the full precision train baseline. The only one I will say is the MobileSSD model. So that has a bit more meaningful drop, but again, a variety of models work pretty well with post training quantization. Now I'll talk about during training, because like I showed in the previous results, you know, there is still some models that will benefit from doing this quantization of our training. And by quantization of our training, we mean we tried to emulate the quantization operations, the quantization losses, during the forward pass of the neural network, with the hope that the parameters will be tuned to account for that. So the process for doing the quantization of our training for using our API, it's a little bit more involved. We are, again, trying to make it very simple. So we built this API in Keras, again, to make it very easy to use. So basically, we assume that you already have a Keras model, and then you just need to call our API to apply the quantization. And this might change a little bit, but it will look something like this. So you just have a model that you already built using Keras layers. And why not? And then the only thing that you need to do is call our API on your model, then you get now a model that is rewritten to have all emulation of quantization. And then you just call your fit function, and that's it. So then you just train your model as usual. And then you can go through the TensorFlow Lite converter, and then it will take this model that was trained with quantization. It will have all the data necessary to quantize it, and then it will produce a quantized model that, just like the post-training model, you will be able to execute in different hardware. These are some numbers from quantization of our training preliminary numbers. If you see the delta is a little bit better than post-training quantization, it's not a very big difference except for the MobileSSD. So before it was 4% for post-training quantization. In this case, it's 2.9%. So quantization or our training is still a useful tool. That's why we're building it. Now you may wonder-- that those where a lot of quantization types and tools, so which one should I use? So my recommendation is if you are just starting, just start with try to reduce floats. That's the first one to try. It is just very easy to use. It doesn't require any data. The accuracy will probably be the same. And then latency, depending on the hardware, you might get some benefits-- reduced latency. And then compatibility-- basically, everywhere you can execute floating point operations, you will be able to use it. The next thing to try will be the hybrid quantization. Again, there is no data requirements. The accuracy will be still good, probably not as good as float 16 in some cases, but it's still good. It will be faster than the reduced float. And basically, compatibility will be everywhere that you have support for float and integer operations. Then the third one to try is the integer quantization with the post-training tool. This one is a bit more complicated just because you need to provide a little bit of data. The accuracy will be worse or the same as hybrid, but the latency of this will be the fastest. And then it will also give you more hardware coverage. And then the last thing to try will be the integer quantization with quantization during training. And basically, this is good. This will be a little bit more involved, because now you're doing training. You're supposed to have now a training setup, a training script. But the accuracy is will be better than doing just the post-training version, and again, you get the benefits of being the fastest one and the one with more hardware coverage. So that was quantization. And again, all these tools, we're trying to make it very easy to use, so it will be great if you try them out and give us some feedback. Then, connection pruning. So what is neural connection pruning? Well, the way that we have implemented it so far, it means it is a training time technique that, during the training process, it will start removing dropping connections from the neural network. And then these connections will-- the dropped connections basically just become zeros in the tensors that you're training, and then that means that you end up with sparse tensors. Now sparse tensors are great, because you can compress them and potentially execute them faster. So this is an example. This is a tensor, how it starts randomly initialized. The dark values means values that are non-zero, and white means values that are zero. And then as the training progresses, then it starts becoming sparser and sparser. And if you see this tensor, it's basically removing most of the parameters there. The process for the API is very similar to the quantization of our training API. Again, we're trying to bring some consistency to our APIs. So it's built on Keras, so it assumes that you have a model that is trainable in Keras. And then you're going to call our API to apply the pruning logic. And this again, we are trying to make this as simple as possible. So the only thing that you need to define is a pruning schedule-- basically, when you want to start dropping these connections, and until when, and how fast, how aggressive you want these prunings to be. And then you just call our prune function, which again will modify your graph to add all the pruning operations internally. And then you just call your fit function, and you train as usual. So basically, you train as usual, and then once you train, you have two options now. Or soon, you will have two options. You can just take the same model, the TensorFlow saved model. You can just compress it, gzip, and then the model will be smaller. And soon, you will be able to convert it via TensorFlow Lite, and you will get also a reduction in size and potentially some speed-ups depending on what prune configuration you're using and the hardware that you're targeting. So this should be done pretty soon. Now what are the benefits of pruning? We've tried it in a lot of tasks, like really a lot of tasks-- on image, speech, audio. And it worked pretty well. And like a lot of techniques that are require hyperparameter tuning, and, you know, careful restarting your models, and things like that. But pruning has worked pretty well without a lot of babysitting. Then it has potential for speed-ups depending on hardware support. And we also have pretty good results. Like, we can make a lot of the parameters basically go away. We see 50% to 90% with negligible accuracy loss. And the other great thing is that it works well also with quantization. So a typical setup that we've tried is with training pruning, and then we use post-training quantization. And basically, the accuracy is pretty good, and you get the compound benefits of all techniques. This is some, older now, results that we have when we launched this. So I mean this is in InceptionV3. We see we can get all the way almost to 90% sparsity with relatively small accuracy losses. And the other-- GNMT's neural machine translation, where again, we can take it to almost 90% pruning and also small accuracy losses. And we've done these, for example, speech recognition. We actually had, recently, the Google Pixel event, where the speech recognition models used pruning and quantization and were able to have a model with server-side quality running on a phone, which is pretty good. OK, so now I'll finally cover, really quick, our roadmap. Like I mentioned, quantization-- we're working on a quantization training API, so that should be ready soon. And we are also working on our specs for quantizing RNNs, which are typically trickier to quantize, like LSTMs. Then I didn't include it there, but we're making some improvements to the hybrid quantization to be more accurate, particularly for convolution layers. And then for sparsity, we're adding support for sparse computation in TensorFlow Lite runtime. Longer term, I don't know if you have heard about MLIR, but it's state-of-the-art compiler infrastructure, but this is particularly interesting to us because it's a better way for us to write these transformations. And at the end, like I said at the beginning of the talk, we're taking a model. We're transforming one program into another representation of that program. And some of the things that we want to enable is better targeted hardware, so our specifications are great because users can target our specification in executing different hardware. But some users just want to [INAUDIBLE] hardware and get the best out of it. So we're hoping that, with the new infrastructure that we're building on top of MLIR, it should be possible. And finally, I really just want to encourage you to try it out and give us feedback-- what techniques you would like to see. You know, researches-- there is techniques popping up all over the place. And a lot of the work that we have to go through is culling what's useful and what's not-- what is general and what is very specific. So we will love to hear your feedback about that and also about the tools that we already have in the toolkit. We're trying to make them as easy as possible to use. We know that we still have a long way to go, but any feedback that you can provide will be really, really appreciated. And I think there is a little bit of time for questions if any of you have questions. [CLAPPING] Thanks. AUDIENCE: Hi. I have a question regarding the [INAUDIBLE].. Hi. Thank you for the presentation. I have a question regarding the training with integer quantizations. In the pipeline, is that going to be true quantization during training? RAZIEL ALVEREZ: No. So right now, by true, I mean that you expect that all the operation happen in the integer domain? AUDIENCE: Yes. RAZIEL ALVEREZ: Not right now. That's something I really want enabled, because I want to make training faster as well. But right now, the way that we are targeting is-- I don't know if you're familiar with TensorFlow APIs, but we have this low-level API, unfortunately called fake quantization, that basically just emulates these losses. And that one is still-- basically, what we do there is we quantize parameters, and then we de-quantize them, and then we do the float operation. So that's what we're using right now. But yeah, longer term, we want to do true integer forward passes. AUDIENCE: Thank you. AUDIENCE: Hi. [INAUDIBLE] Oh, I had just one question. So after you do the quantization, is there a way that you can also visualize the finish quantized model? Yeah, that was one question, and I had another question. Let me think about it. But is there a way that you can also. Oh, the other question was, what sort of tools are you going to provide as far as to sort of do model correctness and-- I mean, at least evaluate, you know, whether this quantized model is sort of functionally correct in a sense? RAZIEL ALVEREZ: Yes. Visualization, again, it depends where. But for TensorFlow Lite, you have a visualizer, so you can see the quantized model. I don't know if it will give you a lot of information, depending what you're looking for. We also want to make our tooling a bit better, because perhaps, for whatever reason, you want to get old research in and start looking at the activations, and how they change, and all that. AUDIENCE: Sure, yeah. There's like inserted ops and so forth. RAZIEL ALVEREZ: Yeah. AUDIENCE: [INAUDIBLE] RAZIEL ALVEREZ: So for sure with the TF Lite visualizer, you can see how the graph changes. So the second question about correctness, correctness is really tricky. Because in my experience, the only thing that really works is to really evaluate on the real data that you care to run your model on. AUDIENCE: Yeah, that's right. RAZIEL ALVEREZ: You know, like, we tried to do things like ultra norms to approximate-- OK, versus the full precision one versus the quantized one. And then it gives you a sense of maybe some really catastrophic numerical errors, but otherwise, it's really just a guess, right. AUDIENCE: That's right. RAZIEL ALVEREZ: Particularly, depending on the output layers, you know, categories are easier to quantize, because, you know, the error is not very meaningful as long as you get the right category. Regressions are much harder because now you really care about the actual values. Yeah, it's an open problem. AUDIENCE: Yeah, it's a tough problem. Thank you. AUDIENCE: I have a question about the results from the GMNT training with induced sparsity. I was wondering if you had any insights on why the training with 80% sparsity would perform better than the original version? Like, if you looked at the results. RAZIEL ALVEREZ: You know, the hand-waving thing, that we always say in these cases, is some regularization happens. [LAUGHTER] Yeah. And you know, I've seen the same with some quantize models. I've never had the gear to really sit down and try to understand what their reasons are for all this. Sometimes it's just because it's within the noise, right? It all depends on your evaluation set, right. If it's really not that big or not that meaningful, then these jumps are all possible. Like, I've seen some models where, oh, it looks great after you quantize it. Then you throw in a new data set, say from speech recognition and noisier utterances, and then you clearly see the difference between one and the other. So a lot of it can be just noise. AUDIENCE: Hi. You mentioned explainability. And a technique could be like saliency maps. Do you have any insights on how these techniques affect the ability to calculate the gradients to calculate the saliency maps, for example? RAZIEL ALVEREZ: You know, like, that's something that we want to invest more, and we haven't had that much time to do it. And I would love for research to get more excited. They are trying to understand neural networks to understand neural networks that have been approximated, but so far, I haven't gotten any luck trying to get the people on that side excited about it. But yeah, I really don't have any meaningful thing to say because I haven't run many experiments over on it. AUDIENCE: Thank you. RAZIEL ALVEREZ: [INAUDIBLE]. AUDIENCE: Hey. So what is the best way to handle fragmentation of hardware? So like, quantization dependent on the target hardware. And more often than not, mobile phones like Android, you have so much [INAUDIBLE] hardware, so what are the best practices there? RAZIEL ALVEREZ: So one way that we tried to do it was again with these specifications. And like, I don't know to what extent it makes our hardware partners happy, because we would like to be able to target their hardware in the most precise and efficient way. But that's one way that we try to address it. You know, with our knowledge of what hardware is there and what is supported, we tried to create these specifications that tried to accommodate for everybody, which again, is good and at the same time is bad. Then longer term, again, I don't want to say too much, because I really don't have a very concrete plan to share. But part of the way we're building with the MLIR infrastructure is we want to be able to better target that hardware-- to better partner with hardware vendors to understand what are their hardware capabilities and better create these transformations that target that hardware. But we were really trying to make it much better. AUDIENCE: So for now, does it mean, like, you go with the lowest common denominator to maybe like a [INAUDIBLE]? Like, imagine the Android app that you have to apply in a lot of things too? RAZIEL ALVEREZ: And that's why we have, like, all these different quantization types. Like, we have three types, right. And soon, hopefully, we'll be able to even just mix and match those different types, because at the end of the day, it's a very arbitrary boundary. Then we say, oh, this is all integer quantized, and this one is hybrid. And the reality is we should be able to take advantage of mixing and matching up precisions to get something better. Thank you. AUDIENCE: I have a question about pruning. As a general rule in layers, operations are converted to matrix multiply because of their efficiency. With pruning, you're now passing in individual multiply operations one by one. There must be some crossover point at which you need to prune by 10%, 15%, 20% before you're crossing over and actually get an improvement. Thoughts on where that is? RAZIEL ALVEREZ: And I don't know if this is exactly what you're asking. So for example, our pruning API supports you specifying what the pruning structure is. So for example, we know that for CPUs [INAUDIBLE] the instructions will typically have registers that can accommodate 16 values. So we know that if we want to speed up on CPU, we expect you to set the setting to say, oh, I want to prune in blocks of, say, 1 by 16. And that's how we can get the speed-ups on CPU, for example. And unfortunately right now, probably it's going to be hardware dependent, but that's one thing that you can do right now.
B1 quantization hardware training model integer float TensorFlow model optimization: Quantization and pruning (TF World '19) 1 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary