Subtitles section Play video Print subtitles [MUSIC PLAYING] CLEMENS MEWALD: My name is Clemens. I'm the product lead for TensorFlow Extended, the end-to-end machine learning platform that we built for TensorFlow. And we have a lot of exciting announcements, so let's jump right in. A lot of you may be familiar with this graph. We published this in a paper in 2017. And the main point that I usually make on this graph is that there's more to machine learning than just the training part. In the middle, the trainer piece, that's where you train your machinery model. But if you want to do machine learning in production reliably and in a robust way, you actually need all of these other components before and after, and in parallel, to the training algorithm. And often I hear, sometimes from researchers, well, I really only do research. I only care about training the machine learning model and I don't really need all of these upstream and downstream things. But what I would argue is that research often leads to production. And what we want to avoid is researchers having to re-implement their hard work, in a model that they've built, when they want to put the model into production. That's actually one of the main reasons why we open sourced TensorFlow because we really wanted the research community to build the models in a framework that we can then use and actually move into production. A second comment that I hear often is, well, I only have a very small data set that fits in a single machine. And all of these tools are built to scale up to hundreds of machines. And I don't really need all of these heavy tools. But what we've seen time and time again at Google is that small data today becomes large data tomorrow. And there's really no reason why you would have to re-implement your entire stack just because your data set grew. So we really want to make sure that you can use the same tools early on in your journey so that the tools can actually grow with you and your product, with the data, so that you can scale the exact same code to hundreds of machines. So we've built TensorFlow Extended as a platform at Google, and it has had a profound impact to how we do machine learning and production and into becoming an AI-first company. So TFX really powers some of our most important Alphabet companies. Of course, Google is just one of the Alphabet companies. So TFX is used at six different Alphabet companies. And within Google, it's really used with all of the major products. And also, all of the products that don't have billions of users [INAUDIBLE] this slide. And I've said before that we really want to make TFX available to all of you because we've seen the profound impact it has had on our business. And we're really excited to see what you can do with the same tools in your companies. So a year ago we talked about the libraries that we had open sourced at that point in time. So we talked about TensorFlow Transform, the training libraries, Estimators and Keras, TensorFlow Model Analysis, and TensorFlow Serving. And I made the point that, back then, as today, all of these are just libraries. So they're low-level libraries that you still have to use independently and stitch together to actually make work and train for your own use cases. Later that year, we added TensorFlow Data Validation. So that made the picture a little more complete. But we're still far away from actually being done yet. However, it was extremely valuable to release these libraries at that point in time because some of our most important partners externally has also had a profound impact with some of these libraries. So we've just heard from our friends at Airbnb. They use TensorFlow Serving in that case study that they mentioned. Our friends at Twitter just published this fascinating blog post of how they used TensorFlow to rank tweets on their home timeline. And they've used TensorFlow Model Analysis to analyze that model on different segments of the data and used TensorFlow Hub to share some of the word embeddings that they've used for these models. So coming back to this picture. For those of you who've seen my talk last year, I promised everyone that there will be more. Because, again, this is only the partial platform. It's far away from actually being an end-to-end platform. It's just a set of libraries. So today, for the very first time, we're actually sharing the horizontal layers that integrate all of these libraries into one end-to-end platform, into one end-to-end product, which is called TensorFlow Extended. But first, we have to build components out of these libraries. So at the top of this slide, you see in orange, the libraries that we've shared in the past. And then in blue, you see the components that we've built from these libraries. So one observation to be made here is that, of course, libraries are very low level and very flexible. So with a single library, we can build many different components that are part of machine learning pipeline. So in the example of TensorFlow Data Validation, we used the same library to build three different components. And I will go into detail on each one of these components later. So what makes a component? A component is no longer just a library. It's a packaged binary or container that can be run as part of a pipeline. It has well-defined inputs and outputs. In the case of Model Validation, it's the last validated model, a new candidate model, and the validation outcome. And that's a well-defined interface of each one of these components. It has a well-defined configuration. And, most importantly, it's one configuration model for the entire pipeline. So you configure a TFX pipeline end to end. And some of you may have noticed, because Model Validation needs the last validated model, it actually needs some context. It needs to know what was the last model that was validated. So we need to add a metadata store that actually provides this context, that keeps a record of all of the previous runs so that some of these more advanced capabilities can be enabled. So how does this context get created? Of course, in this case, the trainer produces new models. Model Validator knows about the last validated model and the new candidate model. And then downstream from the Validator, we take that new candidate model and the validation outcome. And if the validation outcome is positive, we push the model to the serving system. If it's negative, we don't. Because usually we don't want to push a model that's worse than our previous model into our serving system. So the Metadata Store is new. So let's discuss why we need this and what the Metadata Store does. First, when most people talk about machine learning workflows and pipelines, they really think about task dependency. They think there's one component and when that's finished, there's another component that runs. However, all of you who actually do machine learning in production know that we actually need data dependency, because all of these components consume artifacts and create artifacts. And as the example of Model Validation has showed, it's incredibly important to actually know these dependencies. So we need a system that's both task and data aware so that each component has a history of all of the previous runs and knows about all of the artifacts. So what's in this Metadata Store? Most importantly, type definitions of artifacts and the properties. So in our case, for TFX, it contains the definition of all of the artifacts that are being consumed and produced by our components and all of their properties. And it's an extensible type system, so you can add new types of artifacts, if you add new components. And you can add new properties to these artifacts, if you need to track more properties of those. Secondly, we keep a record of all of the execution of the components. And with that execution, we store all of the input artifacts that went into the execution, all of the output artifacts that were produced, and all of the runtime configuration of this component. And, again, this is extensible. So if you want to track things like the code snapshot that was used to produce that component, you can store it in the Metadata Store, as well. So, putting these things together allows us to do something we call lineage tracking across all executions. Because if you think about it, if you know every execution, all of its inputs and all of its outputs, you can piece together a story of how an artifact was created. So we can actually, by looking at an artifact, say what were all of the upstream executions and artifacts that went into producing this artifact, and what were all of the downstream runs and downstream artifacts that were produced using that artifact as an input? Now, that's an extremely powerful capability, so let me talk you through some of the examples of what this enables. The first one is a pretty straightforward one. Let's say I want to list all of the training runs that I've done in the past. So in this case, I am interested in the trainer and I want to see all of the training runs that were recorded. In this case, I had two training runs. And I see all of the properties of these training runs. This is pretty straightforward, yet nothing new to see here. However, I just spoke about lineage. We can visualize that lineage and all this information that we have. The first comment on this slide to make is we're working on a better UI. This is really just for demonstration purposes. But if you look at the end of this graph to the right side, you see the model expert path. This is the specific instance of a model that was created. And as you can see, we see that the model was created by the trainer. And the trainer created this model by consuming a Schema, Transform and Examples. And, again, these are specific instances. So the IDs there are not just numbering, they're Schema of ID number four and Transform of ID number five. And for each one of those artifacts, we also see how they were created upstream. And this allows us to do this lineage tracking and going forward and backward in our artifacts. The narrative I used was walking back from the model but, similarly, you could look at your training data and say, what were all of the artifacts that were produced using that training data? This slide shows a visualization of the data distribution that went into our model. Now, at first glance, this may not be something earth shattering because we've done this before. We can compute statistics and we can visualize them. But if we look at the code snippet, we're not referring data or statistics. We're referring to a model. So we say for this specific model, show me the distribution of data that the model was trained on. And we can do this because we have a track record of all of the data and the statistics that went into this model. We can do a similar thing in the other direction of saying for a specific model, show me the sliced metrics that were produced downstream by TensorFlow Model Analysis, and we can get this visualization. Again, just by looking at a model and not specifically pointing to the output of TensorFlow Model Analysis. Of course, we know all of the models that were trained and where all of the checkpoints lie so we can start TensorBoard and point to some of our historic runs. So you can actually look at the TensorBoard for all of the models that you've trained in the past. Because we have a track record of all of the models that you've trained, we can launch TensorBoard and point it to two different directories. So you can actually compare two models in the same TensorBoard instance. So this is really model tracking and experiment comparison after the fact. And we enable this by keeping a track record of all of this. And, if we have multiple models, you can also look at the data distribution for multiple models. So this usually helps with debugging a model. If you train the same model twice, or on different data, and it behaves differently, sometimes it can pay off to look at whether the data distribution has changed between the two different ones. And it's hard to see in this graph, but here we're actually overlaying two distributions of the statistics for one model and the other. And you would see if there's a considerable drift between those two. So all of these are enabled by this lineage tracking that I just mentioned. Another set of use cases is visualizing previous runs over time. So if you train the same model over time, over new data, we can give you a time series graph of all of the evaluation metrics over time, and you can see if your model improves or gets worse over time as you retrain them. Another very powerful use case is carrying over state from previous models. Because we know that you've trained the model in the past, we can do something we call warm starting. So we can re-initialize the model with weights from a previous run. And sometimes we want to re-initialize the entire model or maybe just an embedding. And in this way, we can continue training from where we left off with a new data set. And another very powerful application of this is being able to reuse previously computed outputs. A very common workflow is to iterate on the model and basically iterate on your model architecture. Now, if you have a pipeline that ingests data, applies transformations to your data, and then you train a model-- every time you make a small change to your model, you don't want to recompute everything upstream. There's no reason why you would have to re-ingest your data, re-compute the transform just because you changed something in your model. Because we have a track record of all of the previous steps, we can make a decision of saying your data hasn't changed, your transform code hasn't changed, so we will reuse the artifacts that were produced upstream. And you can just iterate much, much faster on your model. So this improves iteration speeds, and it also saves compute because you're not re-computing things that you've already computed in the past. So now, we've talked about components quite a bit. Now how do we actually orchestrate TFX pipelines? First, every component has something we call a driver and a publisher. The driver's responsibility is to basically retrieve state from the Metadata Store to inform what work needs to be done. So in the example of Model Validation, the driver looks into the Metadata Store to find the last validated model, because that's the model that we need to compare with the new model. The publisher then basically keeps the record of everything that went into this component, everything that was produced, and all of the runtime configuration, so that we can do that linear tracking that I mentioned earlier. And in between sits the executor. And the executor is blissfully unaware of all of this metadata stuff because it's extremely important for us both to make that piece relatively simple. Because if you want to change the code in one of these components, if you want to change the training code, you shouldn't have to worry about drivers and publishers. You should just have to worry about the executor. And it also makes it much, much easier to write new components for the system. And then we have one shared configuration model that sits on top that configures end-to-end TFX pipelines. And let's just take a look at what that looks like. As you can see, this is a Python DSL. And, from top to bottom, you see that it has an object for each one of these components. From ExampleGen, StatisticsGen, and so on. The trainer component, you can see, basically receives its configuration, says that your inputs come from the transferred output and the schema that was inferred. And let's just see what's inside of that trainer. And that's really just TensorFlow code. So in this case, as you can see, we just use an estimator. And we use to estimator train and evaluate method to actually train this model. And it takes an estimator. And we just use one of our peak end estimators, in this case. So this is a wide and deep model that you can just instantiate and return. But what's important to highlight here is that we don't have an opinion on what this code looks like, it's just TensorFlow. So anything that produces a safe model as an output is fair game. You can use a Keras model that produces the inference graph or, if you choose to, you can go lower level and use some of the lower-level APIs in TensorFlow. As long as it produces a safe model in the right format that it can be used TensorFlow Serving, or the [? eval graph ?] that can be used in TensorFlow Model Analysis, you can read any type of TensorFlow code you want. So, if you've noticed, we still haven't talked about orchestration. So we now have a configuration system, we have components, and we have a metadata store. And I know what some of you may be thinking right now. Is he going to announce a new orchestration system? And the good news is no-- at least not today. Instead, we talked to a lot of our users, to a lot of you, and unsurprisingly found out-- whoops. Can we go back one slide? Yup. Unsurprisingly found out that there's a significant installed base of orchestration systems in your companies. We just heard from Airbnb. Of course, they developed Airflow. And there's a lot of companies that use Kubeflow And there's a number of other orchestration systems. So we made a deliberate choice to support any number of orchestration systems because we don't want to make you adopt a different orchestration system just to orchestrate TFX pipelines. So the installed base was reason number one. Reason number two is we really want you to extend TFX pipelines. What we publish is really just our opinionated version of what a TFX pipeline looks like and the components that we use at Google. But we want to make it easier for you to add new components before and after and in parallel to customize the pipeline to your own use cases. And all of these orchestration systems are really made to be able to express arbitrary workflows. And if you're already familiar with one of those orchestration systems, you should be able to use them for your use case. So here we show you two examples of what that looks like with Airflow and Kubeflow pipelines. So on the left you see that same TFX pipeline configured that is executed on Airflow. And there on my example we use this for a small data set so we can iterate on it fast on a local machine. So in the Chicago taxicab example we use 10,000 records. And on the right side, you see the exact same pipeline executed on Kubeflow pipelines, on Google Cloud so that you can take advantage of Cloud Dataflow and Cloud ML Engine and scale it up to the 100 million [INAUDIBLE] in that data set. What's important here is it's the same configuration, it's the same components, so we run the same components in both environments, and you can choose how you want to orchestrate them in your own favorite orchestration system. So this is what this looks like if it's put together. TFX goes all the way from your raw data to your deployment environment. We discussed a shared configuration model at the top, the metadata system that keeps track of all the runs no matter how you orchestrate those components, and then two ways that we published of how to orchestrate them with Airflow and with Kubeflow pipelines. But, as mentioned, you can choose to orchestrate a TFX pipeline in any way you want. All of this is available now. So you can go on GitHub, on github.com/tensorflow/tfx to check out our code and see our new user guide on tensorflow.org/tfx. And I also want to point out that tomorrow we have a workshop where you can get hands-on experience with TensorFlow Extended, from 12:00 to 2:00 PM. And there's no prerequisites. You don't even have to bring your own laptop. So with this, we're going to jump into an end-to-end example of how to actually go through the entire workflow with the Chicago taxicab data set And just to set some context. So the Chicago taxi data set is a record of cab rides in Chicago for some period of time. And it contains everything that you would expect. It contains when the trip started, when they ended, where they started and where they ended, how much was paid for it, and how it was paid. Now, some of these features need some transformation, so latitude and longitude features need to be bucketized. Usually it's a bad idea to do math with geographical coordinates. So we bucketize them and treat them as categorical features. Vocab features, which are strings, need to be integerized and some of the Dense Float features need to be normalized. We feed them into a wide and deep model. So, the Dense features we feed into the deep part of the model and all of the others we use in the wide part. And then the label that we're trying to predict is a Boolean, which is if the tip is larger than 20% of the fare. So really what we're doing is we're building a high tip predictor. So just in case if there's any cab drivers in the audience or listening online, come find me later and I can help you set this up for you. I think it would be really beneficial to you if you could predict if a cab ride gives a high tip or not. So let's jump right in. And we start with data validation and transformation. So the first part of the TFX pipeline is ingesting data, validating that data-- if it's OK-- and then transforming it such that it can be fed into a TensorFlow graph. So we start with ExampleGen. And the ExampleGen component really just ingests data into it a TFX pipeline. So it takes this input, your raw data. We ship by default capabilities for CSV and TF Records. But that's of course extensible as we can ingest any type of data into these pipelines. What's important is that, after this step, the data is in a well-defined place where we can find it-- in a well-defined format because all of our downstream components standardize on that format. And it's split between training and eval. So you've seen the configuration of all of these components before. It's very minimal configuration in most of the cases. Next, to move onto data analysis and validation. And I think a lot of you have a good intuition why that is important. Because, of course, machine learning is just the process of taking data and learning models that predict some field in your data. And you're also aware that if you feed garbage in, you get garbage out. This will be no hope in a good machine learning model if the data wrong, or if the data have errors in them. And this is even reinforced if you have continuous pipelines that train on data that was produced by a bad model and you're just reinforcing the same problem. So first, what I would argue is that data understanding is absolutely critical for model understanding. There's no hope in understanding why a model is mis-predicting something if you don't understand what the data looked like and if the data was OK that was actually fed into the model. And the question you might ask as a cab driver is why are my trip predictions bad in the morning hours? And for all of these questions that I'm highlighting here, I'm going to try to answer them with the tools that we have available in TFX later. So I will come back to these questions. Next, we really would like you to treat your data as your treat code. There's a lot of care taken with code these days. It's peer reviewed, it's checked into shared repositories, it's version controlled, and so on. And data really needs to be a first class citizens in these systems. And with this question, what are our expected values from our payment types, that's really a question about the schema of your data. And what we would argue is that the schema needs to be treated with the same care as you treat your code. And catching errors early is absolutely critical. Because I'm sure, as all of you know, errors propagate through the system. If your data are not OK, then everything else downstream goes wrong as well. And these errors are extremely hard to correct for or fix if you catch them relatively late in the process. So really catching those problems as early as possible is absolutely critical. So in the taxicab example, you would ask a question like is this new company that I have in my data set a typo or is it actually a real company which is a natural evolution of my data set? So let's see if we can answer some of these questions with the tools we have available, starting with Statistics. So the StatisticsGen component takes in your data, computes statistics. The data can be training, eval data, it can also be serving logs-- in which case, you can look at the skew between your training and your serving data. And the statistics really capture the shape of your data. And the visualization components we have draw your attention to things that need your attention, such as if a feature is missing most of the times, it's actually highlighted in red. The configuration for this component is minimal, as well. And let me zoom into some of these visualizations. And one of the questions that I posed earlier was why are my tip predictions bad in the morning hours? So one thing you could do is look at your data set and see that for trip start hour, in the morning hours between 2:00 AM and 6:00 AM, you just don't have much data because there's not that many taxi trips at that time. And not having a lot of data in a specific area of your data can mean that your model is not robust, or has higher variance. And this could lead to worse predictions. Next, you move on the SchemaGen. SchemaGen takes this input, the output of StatisticsGen, and it infers schema for you. In the case of the chicago taxicab example, there's very few features, so you could handwrite that schema. Although, it would be hard to handwrite what you expect the string values to look like. But if you have thousands of features, it's hard to actually handwrite that expectation. So we infer that schema for you the first time we run. And the schema really represent what you expect from your data, and what good data looks like, and what values your string features can take on, and so on. Again, very minimal configuration. And the question that we can answer, now, is what are expected values for payment types? And if you look here at the very bottom, you see the field payment type can take on cash, credit, card dispute, no charge, key card, and unknown. So that's the expectation of my data that's expressed in my schema. And now the next time I run this, and this field takes on a different value, I will get an anomaly-- which comes from the ExampleValidator. The ExampleValidator takes the statistics and the schema as an input and produces an anomaly report. Now, that anomaly report basically tells you if your data are missing features, if they have the wrong valency, if your distributions have shifted for some of these features. And it's important to highlight that the anomalies report is human readable. So you can look at it and understand what's going on. But it's also machine readable. So you can automatically make decisions based on the anomalies and decide not to train a model if you have anomalies in your data. So the ExampleValidator just takes this input, statistics, and the schema. And let me zoom into one of these anomaly reports. Here you can see that the field company has taken on unexpected string values. That just means that these string values weren't there in your schema before. And that can be a natural evolution of your data. The first time you run this, maybe you just didn't see any trips from those taxi companies. And by looking at it, you can say, well, all of these look like they're normal taxicab companies. So you can update your schema with this expectation. Or if you saw a lot of scrambled text in here, you would know that there's a problem in your data that you would have to go and fix. Moving on, we actually get to Transform. And let me just recap the types of transformations that we want to do. I've led them here in red-- in blue, sorry. So we want to bucketize the longitude and latitude features. We want to convert the strings to ints, which is also calling integerizing. And for the Dense features, we want to actually normalize them to a mean of zero and a standard deviation of one. Now, all of these transformations require you to do a full pass of your data to compute some statistics. To bucketize, you need to figure out the boundaries of the buckets. To do a string to integer, you need to see all of the string values that show up in your data. And to scale to a Z-score, you need to compute the mean and the standard deviation. Now, this is exactly what we built TensorFlow Transform for. TensorFlow Transform allows you to express a pre-processing function of your data that contains some of these transformations that require a full pass of your data. And it will then automatically run a data processing graph to compute those statistics. So in this case, you can see the orange boxes are statistics that we require. So for minimalization, we require a mean and the standard deviation. And what TensorFlow Transform does is it has a utility function that says scale to Z-score. And it will then create a data processing graph for you that computes the mean and the standard deviation of your data, return the results, and inject them as constants into your transformation graph. So now that graph is a hermetic graph that contains all of the information that you need to actually apply your transformations. And that graph can then be used in training and in serving, guaranteeing that there's no drift between them. This basically eliminates the chances of training serving skew by applying the same transformations. And at serving time, we just need to fit in the raw data, and all the transformations are done as part of the TensorFlow graph. So how does that look like in the TFX Pipeline? The Transform component takes in data, schema-- the schema allows us to parse the data more easily-- and code. In this case, this is the user-provided pre-processing function. And it produces the Transform graph, which I just mentioned, which is a hermetic graph that applies the transformations, that gets attached to your training and your serving graph. And it optionally can materialize the Transform data. And that's a performance optimization that sometimes you need when you want to feed hardware accelerators really fast, it can sometimes pay off to materialize some transformations before your training step. So in this case, the configuration of the component takes in a module file. That's just the file where you configure your pre-processing function. And in this code snippet, the actual code is not that important. But what I want to highlight is-- the last line in this code snippet is how we transform our label. Because, of course, the label is a logic expression of saying, is the tip greater than 20% of my fare? And the reason why I want to highlight this is because you don't need analyze phases for all of your transformations. So in cases where you don't need analysis phases, the transformation is just a regular TensorFlow graph that transforms the features. However, to scale something to Z-score, to integerize strings, and to bucketize a feature, you definitely need analysis phases, and that's what Transform helps you with. So this is the user code that you would write. And TF Transform would create a data processing graph and return the results and the transform graph that you need to apply these transformations. So now that we've done with all of this, we still haven't trained our machine learning model yet, right? But we've made sure that we know that our data is in a place where we can find it. We know it's in a format that we can understand. We know it's split between training and eval. We know that our data are good because we validated them. And we know that we're applying transforms consistently between training and serving. Which brings us to the training step. And this is where the magic happens, or so they say. But, actually, it's not because the training step in TFX is really just the TensorFlow graph and the TensorFlow training step. And the training component takes in the output of Transform, as mentioned, which is the Transform graph and, optionally, the materialized data, a schema, and the training code that you provide. And it creates, as output, TensorFlow models. And those models are in the safe model format, which is the standard serialized model format in TensorFlow, which you've heard quite a bit about this morning. And in this case, actually, we produce two of them. One is the inference graph, which is used by TensorFlow Serving, and another one is the eval graph, which contains the metrics and the necessary annotations to perform TensorFlow Model Analysis. And so this is the configuration that you've seen earlier. And, again, that the trainer takes in a module file. And the code that's actually in that module file, again, I'm just going to show you the same slide again just to reiterate the point, is just TensorFlow. So, in this case, it's the train and evaluate method from estimators and a [? canned ?] estimator that has been returned here. But again, just to make sure you're aware of this, any TensorFlow quote here that produces the safe model in the right format is fair game. So all of this works really well. So with this, we've now trained the TensorFlow model. And now I'm going to hand it off to my colleague, Christina, who's going to talk about model evaluation and analysis. [MUSIC PLAYING]
B1 data model tfx graph schema training TensorFlow Extended (TFX) Overview and Pre-training Workflow (TF Dev Summit '19) 1 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary