Subtitles section Play video Print subtitles [MUSIC PLAYING] MINGSHENG HONG: I'm Mingsheng Hong, tech lead and manager of the TensorFlow runtime team. Today, I'm excited to share with you our new project, codenamed TFRT. As you probably guessed, it stands for none other than TensorFlow Runtime. Some people might tell you that, in the world of TensorFlow, runtime is what keeps tensors flowing. I think that if runtime does this job, you should never have to think about it. But since we are here talking about the runtime, let's first take a look at where runtime fits into the TensorFlow stack. Here's a diagram on the training workflow. Runtime can be driven by eager APIs. It can also execute graph programs produced by a graph compiler. Runtime is a low-level component that orchestrates all model execution by calling into the relevant kernels that implement machine learning primitives like matrix multiplications. We're building TFRT to replace the existing runtime. And let's first talk about why. We talked to many of you, our TensorFlow users, and heard from your pain points and requests. First, many of you are pushing the envelope in the performance and scalability of model processing across both eager and graph execution. Second, you are making continuous innovations through the addition of ops, kernels, and devices to TensorFlow. And we need to make such extension work more streamlined and productive. And once you are done with the model research and tuning, you'll want to deploy TensorFlow everywhere, across a diverse set of hardware platforms. For those great reasons, we're building a new runtime to help you, a new runtime that provides the performance, extensibility, and unification that you all are looking for. So how does TFRT fit into the workflow of an ML model? Here, we see the TF training stack again. Through TensorFlow APIs, your program can either eagerly dispatch ops to the runtime, as you can see from the blue arrows on the left side of the diagram. Or as the red arrows on the right side show, in the case of graph execution, your program first generates a computational graph, which gets lowered to the optimized target-specific program, and then dispatched to the runtime. The optimization and loading work uses the MLIR compiler framework, which Jacques just spoke about in his MLIR talk. Finally, in both execution paths, TFRT will call into a set of kernels to complete a model execution, as the purple arrow shows. Again, the term, kernel, here refers to device-specific operations, like a GPU-based matrix multiplication. TFRT orchestrates the efficient kernel execution over a heterogeneous set of hardware. Now let's dive a little more into the technical design and look at how we realized the vision of building a performant, extensible, and unified runtime. First, to achieve high performance, we built a lock-free graph executor that supports concurrent op execution with low synchronization overhead. We have also made the eager op dispatch stack very, very thin. And the eager API course will call into the relevant kernels with minimal runtime overhead. Second, to talk about extensibility, let's first cover some background. Host runtime is the component that drives host CPU and I/O work, and it also drives locally attached devices through the device runtimes. TFRT keeps device runtimes separate from the host runtime, so that wheere you add new device runtimes, you don't have to extend the rest of the runtime. The TFRT design also focuses on building common abstractions, such as shape functions and kernels, to be used in both graph and eager execution. And this way, we get consistent behavior between eager and graph, and also avoid duplicated engineering efforts. Now, if you feel a bit lost in the last slide, don't worry about it. Let's step back and let's look at how these key design decisions will benefit the core TensorFlow use cases. For those of you who care about training, you will see improved performance as well as error reporting. And that should make it easier to debug your models. If you deploy TensorFlow models in production, you'll be glad to see some improved performance and some reduced CPU usage. And I will show you in a benchmarking study shortly. TFRT will also support deployments across diverse hardware platforms. And in the next couple of slides, I will show you some initial results on serving support. TFRT is integrated into TensorFlow Serving to form a flexible, high-performance serving system for production environments. If you follow the orange arrow, it shows a pre-trained model that's loaded into TFRT through TensorFlow same-model API. Now, the blue arrows showed that the serving clients can send the request to the model and get prediction results back. We expect this TFRT integration to be largely transparent to the end users. So TFRT works for serving. How does it perform? In this benchmarking study, we used an MLPerf benchmark model, Resnet 50, and measured the performance of GPU inference over TFRT, and compared to the current stack. We chose to use FP16 and the batch size of 1 to focus the performance study on the runtime-related op dispatch overhead. Let's now look at the numbers. Can I have some drumrolls, please? [DRUM ROLL] Thank you. I should first note that the current runtime time is already highly optimized for graph execution and serving needs. Through multiple runs, it had a respectable average inference time of 3.3 milliseconds. In comparison, TFRT had an average inference time of 2.4 milliseconds. Bam. There you have it. This is a handsome improvement of 28%. And there are more optimizations under the way. Our internal testing also showed that TFRT is scoring favorably over alternatives of TensorFlow on this model. We are very, very excited about this. The performance improvements are due to the more efficient use of multi-threaded CPUs, the asynchronous runtime design, and a general focus of low-level efficiency. While your actual mileage might vary depending on your workloads, this encouraging result helps validate our initial work, and prepares us for the ongoing push to make TFRT production-ready. I know. You're excited too, right? And you're probably wondering, when can I have it? We will ship TFRT this year. And this is going to be an exciting journey. In addition to the maintenance and selected enhancement to the current stack, we plan to build out and integrate TFRT with the TensorFlow stack. As we roll out TFRT, we will initially make it available through an opt-in flag, giving us some time to fix issues and fine-tune the performance. Eventually, it will become the default runtime. We also plan to open-source this project in the near future. We would love to get you all more involved. We will keep you updated on our progress through the developers@tensorflow.org mailing list. So please make sure to join it. And also, if you would like to learn more about TFRT design, please join our deep-dive tech talk at the March 19 MLIR open design meeting. The meeting is open to all members of the community. Thank you all, and look forward to following up. [MUSIC PLAYING]
B2 runtime execution graph performance model eager TFRT: A new TensorFlow runtime (TF Dev Summit '20) 4 1 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary