Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MAKOTO UCHIDA: Hello, everyone.

  • My name is Makoto, a software engineer

  • in TensorFlow enterprise as a part of Google Cloud.

  • Now that we have seen the great story about TensorFlow

  • in production at work, and its cool use cases even in space,

  • now I'm going to talk about the enterprise-grade application

  • with TensorFlow Enterprise.

  • So what does it mean to be TensorFlow enterprise?

  • What is so different?

  • What is so difficult?

  • Well, after talking to many customers,

  • we have identified a couple of key challenges

  • when it comes to enterprise grade ML.

  • First is the scale and the performance.

  • When it comes to production grade enterprise applications,

  • oftentimes the size of data, the scale of a model,

  • is beyond what fits into my laptop or workstations.

  • As a result, we need to think about this problem differently.

  • Second is the manageability.

  • When developing business applications,

  • it is better to not have to worry

  • about any nitty-gritty details of infrastructure complexity,

  • including managing software environment

  • and managing multiple machines in clusters and what not.

  • Instead, it is desirable to only have

  • to concentrate on the core business logic of your machine

  • learning applications so that it makes the most

  • benefit to your business.

  • Third is the support.

  • If your application is business critical and mission critical,

  • timely resolution to the bugs and issues and a commitment

  • to stable support for applications

  • is essential to continue operating your applications.

  • TensorFlow Enterprise brings a solution to those challenges.

  • Let's take a look at the cloud scale performance.

  • In a nutshell, with TensorFlow Enterprise,

  • we compile and ship a special build

  • of TensorFlow, specifically optimized for Google Cloud.

  • It is purely based on the open source TensorFlow,

  • but it also contains a specialized optimization

  • specifically for Google Cloud machines

  • that it services in the form or patches and add-ons.

  • Let's take a look at how it looks like in practice.

  • This code trained the model with potentially very large training

  • data, then maybe a terabyte of data,

  • maintained in Google Cloud Storage.

  • As you see it, it is no different

  • from any typical TensorFlow code written with a data set APIs,

  • except the path to the training file

  • is pointing to the Google Cloud Storage.

  • Under the hood, the optimized I/O reader specifically

  • made for Google Cloud Storage is making this performant even

  • with terabyte of training data, and it make

  • your training very performant.

  • This is another example that used training data

  • from BigQuery table, which is a data warehouse which

  • may maintain 100 millions of business data--

  • data warehouse.

  • This example is a little bit more involved, but still

  • similar to the standard data set API that all of you

  • are familiar with, for that--

  • that your model can still train in your familiar ways,

  • but under the hood, the optimized I/O of a BigQuery

  • can read many, many millions of rows in parallel

  • in an efficient way.

  • And it turns into Tensor so that your training can

  • continue with the performance.

  • This is a little comparison of the throughput when

  • large data is read from Google Cloud Storage,

  • with or without optimization that TensorFlow interface

  • brought in.

  • As you see it, there is a nice throughput gain.

  • The better I/O throughput performance actually

  • translates into the better utilization

  • of processes such a CPUs and GPUs

  • because I/O is no longer the bottleneck

  • of the entire training.

  • What it means is your training finishes faster

  • and your training wall time is shorter.

  • As a result, your cost of training

  • is actually cheaper, because the complete cost

  • is proportionate to the wall time

  • that you use the compute resources.

  • Now that you have some ideas about what kind of optimization

  • we were able to make to TensorFlow,

  • specifically for Google Cloud, let's see

  • how you actually get it and how you actually

  • take the benefit of it.

  • We do this through managed services.

  • We deliver TensorFlow Enterprise through a managed environment

  • which we call our Deep Learning Virtual Machine images

  • and Container images, where all the environment is

  • pre-managed and pre-configured, on top of standard Linux

  • distributions.

  • Most important is it has TensorFlow Enterprise build

  • pre-installed, together with all the dependencies,

  • including device drivers and a dependency to Python packages

  • with correct version combinations

  • or what not, as well as configuration

  • to the other services in Google Cloud.

  • Because this is just a normal virtual machine

  • image and container images, you can actually

  • deploy it in many different ways in Cloud.

  • Regardless of where you deploy it or how you deploy it,

  • the TensorFlow Enterprise optimization

  • is just there, so you can take the benefit

  • of all that good performance.

  • To get started, you only have to pick the TensorFlow Enterprise

  • image and desired resources such as CPUs

  • and RAMs, or optimized GPUs.

  • Start the virtual machine in just one command.

  • In the next moment, you can already

  • access existing machine that has TensorFlow Enterprise build

  • already pre-installed and pre-configured,

  • and it's ready to use so that you can immediately

  • start writing your code in the machine.

  • If you prefer a notebook environment,

  • JupyterLab is hosted and already stored in the VM, actually.

  • The only thing you have to do is you only

  • have to point your brother to the VM

  • and open up the JupyterLab and open up a notebook

  • so that you can start writing your TensorFlow code,

  • taking the benefit of TensorFlow Enterprise.

  • Once you have the satisfactory model

  • after many iterations of experimentations,

  • now is the time to train your model at the full scale.

  • It may not fit into the one machine

  • and you may want to take advantage of the distributed

  • training facility that TensorFlow

  • offers so that it can support the large scale of data

  • on the model.

  • For this, AI Platform Training is a managed service

  • that takes care of the distributed training

  • clusters and all other infrastructure

  • complexities on behalf of you.

  • More importantly, it drives the same TensorFlow Enterprise

  • container image, which is exactly

  • the same environment you have used

  • to build your actual model, so you can be confident

  • that your model just trains with full scale of data

  • under the managed training service.

  • You simply need to overlay your application code

  • on top of the TF Enterprise container image,

  • then issue one command to start a distributed training cluster.

  • This example is grabbing at 10 workers with larger machines

  • per each worker with 8 GPUs attached to each worker

  • to train-- potentially build a large data set

  • for your [INAUDIBLE] applications.

  • This example brings up a distributed training cluster

  • with all TensorFlow Enterprise optimization included,

  • with 10 worker distributions.

  • Now that you can train your model in a full enterprise

  • scale, it is the time to make it an end to end pipeline

  • to continue running it in production,

  • taking advantage of AI Platform Pipelines and TensorFlow

  • Extended.

  • AI Platform Pipelines is actually

  • hosted on the community's engine,

  • but what this mean is it can also drive exactly

  • the same TensorFlow Enterprise container image,

  • so that all optimization is still there,

  • and you can still be confident that your application

  • in the pipeline just runs because it

  • is all the same environment.

  • After end to end application runs in production,

  • the Enterprise-grade support becomes

  • essential to mitigate any risk of interruption

  • in the operation and also to continue

  • operating your application in a business critical manner.

  • Our way to mitigate this risk is to provide long-term support.

  • With open source TensorFlow, we typically

  • offer one year of maintenance window.

  • For TensorFlow Enterprise, we provide three hours of support.

  • That includes critical bug fixes and security patches.

  • And additionally and optionally, we

  • may backboard certain functionalities and features

  • from the future leaders of TensorFlow as we see demand.

  • As of today, we have TensorFlow Enterprise version

  • 1.15 and 2.1 as our long-term supported versions.

  • If your business is pushing the boundary of AI

  • and if your business is sitting at the cutting edge of AI,

  • where normal application and use cases

  • are critical to your business model,

  • and also your business is heavily

  • relying on being able to continue innovating

  • on this space, we actually want to work with you

  • through the white-glove service program.

  • We engineers and creators of both TensorFlow and Google

  • Cloud are willing to work with your engineers and your data

  • scientists to mitigate any future bugs and issues that we

  • may not have seen yet to support your cutting edge applications,

  • to unblock you and together advance your applications, as

  • well as TensorFlow and TensorFlow

  • Enterprise as a whole.

  • Please check out the website and see

  • for detail of this white-glove service program.

  • Looking ahead, we are really excited to keep

  • working tightly together between TensorFlow teams and Google

  • Cloud teams.

  • As being the creators and experts and owners of the boss

  • products, we continue to make the optimizations that

  • have implemented TensorFlow for Google Cloud.

  • That includes better monitoring and debugging capabilities

  • to your TensorFlow code that runs in Cloud,

  • as well as integration of this capability

  • into Google Cloud tooling for the better productivity

  • of your applications.

  • We are also looking at the smoother integration

  • between TensorFlow, popular high-level APIs such as Keras

  • or Keras Tuner and managed training services,

  • as well as even more managed services,

  • such as a continuous TensorFlow dev

  • for the purpose of coherent and joyful developer experiences.

  • Please stay tuned.

  • This concludes my talk about TensorFlow Enterprise.

  • For more information and for the details,

  • please do check out the website.

  • Thank you very much.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it