Subtitles section Play video Print subtitles [MUSIC PLAYING] TRIS WARKENTIN: Hi, everyone. I'm Tris Warkentin and I'm a product manager on TensorFlow Extended, or TFX. ZHITAO LI: Hi, my name is Zhitao Li. I'm a tech lead manager from TFX Open Source. TRIS WARKENTIN: At Google, putting machine learning models into production is one of the most important things our engineers and researchers do. But to achieve this global reach and production readiness, a reliable production platform is critical to Google's success. And that's the goal of TensorFlow Extended-- to create a stable platform for production ML at Google, and a stable platform for you to build production-ready ML systems, too. So how does that work? Our philosophy is to take modern software engineering and combine it with what we've learned about machine learning development at Google. So what's the difference between writing code and doing machine learning engineering? In coding, you might build something that one person can create end to end. You might have untested code, undocumented code, and code that's hard to reuse. In modern software engineering, we have solutions for all of those problems-- test-driven development, modular designs, scalable performance optimization, and much more. So how is that different in machine learning development? Well, a lot of the problems from coding still apply to ML, but we also have a variety of new problems. We have no problem statements. We might need some continuous optimization. We might need to understand when changes in data will result in different shapes of our models. We've been doing this at Google for a long time. In 2007, we launched Sibyl, which was our production scalable platform for production ML here at Google. And since 2016, we've been working on TFX, and last year we open sourced it to make it even easier for you to build production ML in your platforms. What does it look like in practice? Well, the entirety of TFX as an end to end platform runs from best practices all the way through to end to end. From best practices, you don't even have to use a single line of Google developed code in order to get some of the best of TFX, all the way through to end to end pipelines that allow you to deploy scalable production scale ML. This is what a pipeline might look like. On the left side of the screen, you'll see data intake, then it runs through the pipeline, doing things like data validation, schema generation, and much more in order to make sure that you're doing things in a repeatable, testable, consistent way and producing ML production results. So it's hard to believe that we've only been one year in open source for our end to end pipeline offering, but we have a lot of interesting things that we've done in 2019, including building the foundations of metadata, building basic 2.0 support for things like estimators, as well as launches of Fairness Indicators and TFMA. But we're definitely not done. In 2020, we have a wide variety of interesting developments coming, including NativeKeras on TFX, which you'll hear more about from Zhitao later, as well as TensorFlow Lite trainer rewrite and some warm starting, which can make your machine learning training a hundred times faster by using caching. But we have something really exciting that we're announcing today, which you may have heard from Megan about in the keynote, which is end to end ML pipelines. These are our Cloud AI Platform Pipelines. We're really excited about these, because they combine a lot of the best of Google AI Platform with TFX to create Cloud AI Platform Pipelines, available today. Please check out our blog for more information. You should be able to find it if you just Google "Cloud API Platform Pipelines." And now, can we please cut to the demo? ZHITAO LI: So I'll be giving an explanation about this demo. This is the Cloud AI Platform Pipelines page. As you see, you can see all your existing Cloud AI Pipeline clusters in this page. We've already created one, and this page can be found at AI Platforms Pipelines tab from the left of the Google Cloud Console. If you don't have any pipelines cluster created yet, you can use the New Instance button to create a new one. This gives you a click button experience while creating clusters, which is usually one of the difficult jobs in the past. You can use the Config button to create a Cloud AI Pipelines on Google Cloud. This gives you Cloud AI Pipelines space on Kubenetes. Runs on Google's GKE. You can choose a class it will run it from, choose the namespace where you want to create a class for inside, and choose a name of a cluster. Once you are done, you can simply click Deploy and Done. Since I already have a cluster, I will open up the Pipeline dashboard here. In this page, you can see a list of demo pipelines that you can play with. You can see tutorials about creating pipelines and doing various techniques, and you can use the Pipelines tab from the left to view all your existing pipelines here. Since this class is newly created, there is no TFX pipelines in it yet. We are going to use the newly launched TFX templates to create the Cloud AI Pipelines in this cluster. This is the Cloud AI Notebook. I'm pretty much using this as a Python shell to write some simple Python commands. First, you set up your environments and then making sure TFX is properly installed, together with some other dependencies. Making sure you have environment variables like Path Properties set up, and the TFX version is up to date. Now you're making sure you have a Google Cloud project-- perfect config. In the Config, the Cloud AI Pipelines cluster endpoint. Simply copy that from the URL into the Notebook shell. Now, we're also making sure we create the Google Container Image repo so that we can upload our containers, too. Once that is done, we config the pipeline name and the project directory. Now we can use the Template Creation to create a new template. Since they are created, I'm going to show the content created by the templates. As you see, there are pipeline code in the pipeline.py file. This includes our classical taxi pipeline from TFX with all the components necessary to do production machine learning. There is configs.py, with some configurations related to Google Cloud as well as some configuration about TFX itself. Once that is done, we enter the Templates Directory, making sure all the templates valid are there. You can even run some pre-generated unit test on the features to making sure the configuration looks right. Once that's done, you can then use TFX CLI command to create a TFX pipeline on Google Cloud Pipelines page. This will create a temporary image with all your code and the dependencies. Upload them to DCR, then create a pipeline using this container image on the Pipelines page. As we see, the pipeline compiles and the creation is successful, and we go back to the Pipeline page. Click on Refresh. Boom-- we have our new pipeline. Now, if we click through the pipeline, you are going to see all the TFX components here readily available. We can create a test to run on this one. And click on the Run. We are going to see each of the components. When they run, they will gradually show up on the web page. The first component should be ExampleGen. So it has to be ExampleGen. Yes, it is there. This component has started running. You can click on it. On the tab, you can look at the artifacts, inputs and outputs, what Kubernetes' volumes used for the component, manifest, and you can even inspect the logs of a component run. We call this ExampleGen, StatsGen, SchemaGen. And now the pipeline enters feature transform and the example validation at the same time. So now all the data preparation is finished. The pipeline enters into a training stage, which is producing a TensorFlow model. If we click on the trainer component, we can even inspect these logs. Now, once trainer is complete, we do some model validation and evaluation using TFX components. And once all the model evaluation is done, we use Pusher to push the generated model onto external serving system. So everything-- so you have a model ready to use in production. You can also use the tabs on the left to navigate on existing experiments, artifacts, and executions. We are going to take a look at the artifacts generated from this pipeline using the Artifacts tab. So here, you can see you have a pipeline. If you click on the Model Output Artifacts from Trainer, that represents a TensorFlow model. This is the artifact ML metadata. We can see it's a model artifact produced by trainer. And this is a lineage view of the model-- this explains what components produced this model from what input artifacts and how this artifact is further used by other components which takes this one's input and what outputs are generated from the downstream components. OK this is all of the demo. Now I'm going to talk about another important development in TFX, which is supporting NativeKeras. For those of you who are not very familiar with TensorFlow 2, let me capture a little of the history. TensorFlow 2 was released in Q3 2019 with a focus on providing a more Pythonic experience. That includes supporting the Keras API, eager execution by default, and the Pythonic execution. So this is a timeline of how TFX Open Source has been working on supporting all of this. We released the first version with TFX Open Source in the last Dev Summit, which only supports estimator-based TensorFlow training code. Back to last TensorFlow World, TensorFlow 2.0 was launched and we started working on supporting the Keras API. Previous slide, please? Previous slide, please? Back to Q4 2019. We released the basic TensorFlow 2.0 support. In TFX 0.20-- in that version, we supported TensorFlow 2.0 package end to end with a limited Keras support with Keras Estimator. And now, in the latest release TFX, I'm happy to release-- we are releasing experimental support of NativeKeras training end to end. So what does that mean? Let's take a deeper look. For data ingestion analysis, everything pretty much remains the same, because TFTP, our data analysis library, is model agnostic. For future transform, we added a new Keras compatible layer in TFT library so that we can transform features in Keras model. This layer will also take care of asset management and the model exporting. For training, we created a new generic trainer executor which can be used to run any TensorFlow training code which explores a saved model. This also covers the training using NativeKeras API. For model analysis and validation, we create a new evaluator component, which combines both evaluation and the model validation capabilities. This new component supports NativeKeras auto blocks. And finally, when it gets to model serving validation, we will release a new component called infra validator. This component can be used to verify inference request, where TensorFlow is serving binaries, to making sure any exported TensorFlow model can be used correctly in production, including anything traded with NativeKeras. Now, let's take a look at a case from one of our partners, Concur Labs. Concur Labs is a team in SAP Concur that explores new ideas and are building prototypes. They help all the developers in SAP Concur using ML effectively in their solutions. To do this, they need a modern machine learning pipeline to secure scales that are facing their data platform. They find that TFX with NativeKeras trainer allows them to do more things. One of the successful story here is the efficient BERT deployment. With the TF and the Keras, they can create simple models by using TF Transform for the data pre-processing, using state of the art models from TensorFlow hub and create simple model deployments with TensorFlow Serving. The applications here covers some of the sentimental analysis and some of the question and answer type problems. They also create a TFX pipelines to build the pre-processing steps in export data BERT models. We have just published a blog post on this, so feel free to check out the TFX blog post. Another successful story is TFX Pipelines for TFLite Models. This means people-- they can create a TFX pipelines that produced two models-- one in saved model format and the one in TensorFlow Lite version. This simplifies our pipeline building process and it reduces manual conversion steps. One of the most important things for the future of our ecosystem is great partners, like all of you on the livestream. We hope you join us to help making TFX work for your use case. Some of the areas we would love your help, including portability, on-prem and the multi-cloud, Spark/Flink/HDFS integration, as well as data and the model governance. One of the great things about TFX is the wide diversity of the ways It can be used. We have a exciting guest from-- Marcel from Airbus. He could not be here physically today, but he recorded a video to talk about one of the interesting ways to use TFX which is TFX in space. For more, here is Marcel. [MUSIC PLAYING] MARCEL RUMMENS: This is one of the most important moments for everyone involved in manned spaceflight. This is why we at Airbus are working hard and keep on innovating to ensure everyone on that space station is safe and can return back to Earth, to friends and family. Hello everyone, my name is Marcel Rummens, and I have the honor to tell you how Airbus uses anomaly detection with TFX to ensure everyone's safety onboard the International Space Station. You might ask yourself, Airbus, space-- how does that fit? Well, Airbus actually has many different products, like commercial aircrafts, helicopters, satellites, and we are also involved in manned spaceflight. For example, the Columbus Module, which is part of the ISS. It was built and designed by Airbus and finally launched in 2008. It is used for experiments in space-- for example, in the field of biology, chemistry, material science, or medicine. As you kind of mentioned, such a module produces a lot of data. To give you an idea, we have recorded between 15,000 and 20,000 parameters per second for the last 10 years. And every second, we receive another 17,000 parameters. So we are talking about trillions and trillions of data points. But what does this data actually represent? Well, it could be any kind of sensor data, and we want to detect anomalies in it to prevent accidents. Let me give you an example. If you have a power surge in your electrical system, this could cause a fire. If you have a drop in temperature or pressure, this could mean that you have a hole in your module that you need to fix. So it is very important that we detect these anomalies and fix them before something can happen. Because just imagine for a second it is you up there, 400 kilometers above earth, and you only have these few inches of metal and plastic to protect you from space. It's a cold and dark place without any oxygen. If something would have happened to this little protection layer of yours, you would be in a life-threatening situation. This is why we already have countless autonomous systems on board the spacecraft to prevent these kinds of accidents. But with more and more automation, we can handle more complex data streams, increasing our precision offset predictions. But it is not about safety alone. It is also about plannability and predictive maintenance, because the sooner we know that a certain part needs replacement, the sooner we can schedule and plan a supply mission, decreasing the cost of said launches. How does this work? Well, right now this is a manual process. Our automation works in parallel with the engineers. Let me run you through it. We have our database on premise, storing all these trillions and trillions of data points. And then we use a Spark cluster to extract the data and remove secret parts from it, because some things we are not allowed to upload. Then we use TFX on KubeFlow to train the model. First, we use TF Transform to prepare the data, and then we use TF Estimator to train a model which tries to represent the normal state of a subsystem-- so a state without any anomalies. Once we've done enough hyperparameter tuning, we are happy with the model we deployed using TF Serving. Now, here comes the interesting part. We have the ISS streaming data to ground stations on earth, which stream data to our NIfTI cluster running in our data center. Here we remove the secret part of the data again and then stream it into the cloud. In the cloud, we have a custom Python application running on Kubernetes which does the actual anomaly detection. It will create a model. The model will try to predict the current state of a subsystem based in whatever it has seen in the past. And then using its prediction and the reality coming from the space station, we can calculate a so-called representation arrow. If this arrow is above a certain threshold, we can use this as an indicator for an anomaly. Now, if we have an anomaly, we would create a report and then compare it against a database of previously happened anomalies, because if something like this happened in the past, we can reuse the information we have on it. And then the final step is to hand this over to an engineer, who will then fix the problem. And this is very, very important to us, because we are talking about human life on that space station, so we want a human to make the final decision in this process. But this is TF Dev Summit, so I want to have at least one slide about our model architecture. We are using LSTM-based autoencoder with dropout, and then we replaced the inner layers-- so the layers between encoder and decoder-- with LSTMs instead of dense layers, because our tests have shown that sequences just better represent the kind of information we have, producing less false positives. What kind of impact did this project have? Well, we have been able to reduce our cost by about 44%. Some of this is projected because, as I said earlier, we are running in parallel right now. But the cost/benefit mainly comes from the fact that our engineers can dedicate more and more time on more important, less repetitive tasks, tasks for which you really need that human creativity and intuition to fix these problems. Also, our response time has decreased. In the past, it could take hours, days, or sometimes even weeks to find and fix a problem. Now, we are talking about minutes, maybe hours. Another benefit is that now we have a central storage of all anomalies that ever happened, plus how they have been fixed. This is not just good for our customer, because we have better documentation, but also great for us because, for example, it simplifies the onboarding process for new colleagues. Next steps, looking into the future. We want to extend the solution to more subsystems and more products, like Bartolomeo, which is the latest addition to the Columbus Module and scheduled to be launched later this month. But overall, these kinds of technologies are very important for future space missions, because as plans to go to moon and Mars become more and more concrete, we need new ways and new technologies to tackle problems like latency and the limited amount of computational hardware onboard the spacecraft. Coming back to TFX, we want to integrate more components-- for example, the model validator-- because now we have more labeled data that allows us to actually do this automatic model validation. And finally, the migration to TF 2, which is ongoing but very important to us, because of course, we want to keep up and use the latest version of TensorFlow. If you have any questions or want to learn more about the challenges we faced during that project, have a look at the Google blog. We will publish a blog post in the coming weeks that goes into more detail than I could do in 10 minutes. Before I close, I want to especially thank [? Philip ?] [INAUDIBLE],, Jonas Hansen, as well as everyone else of the ISS analytics team for their incredible work and support, as well as anyone else who helped to prepare this talk. If you have any questions, feel free to find me on the internet. Write me those questions-- I'm happy to answer them. And also, I will be available in the Q&A section of this livestream. Thank you very much and goodbye. ZHITAO LI: Thank you, Marcel, for the great video. And if you want to learn more about TFX, please check out our website, which has all the blog post, user guide tutorials, and API docs. Please also feel free to check out our GitHub repo with all the source code and engage about us on the Google Group. [MUSIC PLAYING]
B1 tfx pipeline model cloud data production TFX: Production ML with TensorFlow in 2020 (TF Dev Summit '20) 2 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary