Subtitles section Play video Print subtitles [MUSIC PLAYING] KARMEL ALLISON: Hi, and welcome to Coding TensorFlow. I'm Karmel Allison, and I'm here to guide you through a scenario using TensorFlow's high-level APIs. This video is the first in a three-part series. In this, we'll look at data and, in particular, how to prepare and load your data for machine learning. The rest of the series is available on this channel, so don't forget to hit that subscribe button. Building a machine learning model is a multi-stage process. You have to collect, clean, and process your data, prototype and iterate on your model architecture, train and evaluate results, prepare your model for production serving, and then, you have to do it all over again because the model is a living thing that will have to be updated and improved. TensorFlow high-level APIs aim to help you at each stage of your model's lifecycle-- from the beginning of an idea to training and serving large-scale applications. In this series, I will walk through the key steps in developing a machine learning model and show you what TensorFlow provides for you at each step. And then, I'll also cover some of the new developments that we are working on to continue to improve your workflow. We start with the problem-- an associated data set. We will use the covertype data set from the US Forestry Service and Colorado State University which has about 500,000 rows of geophysical data collected from particular regions in National Forest areas. We are going to use the features in this data set to try to predict the soil type that was found in each region. And there are a mix of features that we'll be working with. Some are real values-- elevation, slope, aspect, and so on. Some are real values that have been binned into an 8-bit scale, and some are categorical values that assign integers to soil types and wilderness area names. If we inspect the first couple rows of our data, this is what we see-- integers, no header, so we have to work from the Info file. OK, so here we can see that we have some of our real values, and it looks like some of the categorical values are one-hot encoded, and some are just categories. Some features span multiple cells, so we'll have to handle that. Where do we start? What's the first thing we should do here? I'm going to suggest to you that when you're prototyping a new model in TensorFlow, the very first thing you should do is enable eager execution. It's simple. You just add a single line after importing TensorFlow, and you're good to go. The way it does that is rather than deferring execution of your TensorFlow graph, it runs ops immediately. The result is that you can write your models in eager while you're experimenting and iterating, but you still get the full benefit of TensorFlow graph execution when it comes time to train and deploy your model at scale. The first thing we're going to want to do is load our data in and process the data in columns so that we can feed it into a model. The data is a CSV file with 55 columns of integers. We'll go over each of those in detail in a bit, but first we will use the TensorFlow CSV data set to load our data from disk. This particular data set doesn't have a header, but if it did, we could process that as well with the CSV data set. Now, a TensorFlow data set is similar to a NumPy array or a Pandas DataFrame in that it reads and processes data. But instead of being optimized for in-memory analysis, it is designed to take data, run the set of operations that are necessary to process and consume that data for training. Here, we are telling TensorFlow to read our data from disk, parse the CSV, and process the incoming data as a vector of 55 integers. Because we are running with eager execution enabled, our data set here does already represent our data, and we can even check to see what each row currently looks like. If we take the first row, we can see that right now, each row is a tuple of 55 integer tensors-- not yet processed, batched, or even split into features and labels. So we have tuples of 55 integers, but we want our data to reflect the structure of the data we know is in there. For that, we can write a function to apply to our data set row by row. This function will take in the tuple of 55 integers in each row. A data set is expected to return tuples of features and labels. So our goal with each row is to parse the row and return the set of features we care about plus a class label. So what needs to go in-between here? This function is going to be applied at runtime to each row of data, but it will be applied efficiently by TensorFlow data sets. So this is a good place to put things like image processing or adding random noise or other special transformations. In our case, we will handle most of our transformations using feature columns which I will explain more in a bit, so our main goal in the parsing function is to make sure we correctly separate and group our columns of features. So for example, if you read over the details of the data set, you will see that the soil type is a categorical feature that is one-hot encoded. It is spread out over 40 of our integers. We combine those here into a single length-40 tensor so that we can learn soil type as a single feature rather than 40 independent features. Then we can combine the soil-type tensor with the other features which are spread out over the set of 55 columns in the original data set. We can splice the tuple of incoming values to make sure we get everything we need. And then we zip those up with human-readable column names to get a dictionary of features that we can process further later. Finally, we convert our one hot-encoded wilderness area class into a class label that is in the range 0 to 3. We could leave them one-hot encoded as well, and for some model architectures or loss calculations that might be preferable. And that gives us features and a label for each row. We then map this function to our data row-wise, and then we batch the rows into sets of 64 examples. Using TensorFlow data sets here allows us to take advantage of many built-in performance optimizations that data sets provide for this type of mapping and batching to help remove I/O bottlenecks. There are many other tricks for I/O performance optimization, depending on your system, that we won't cover here, but a guide is included in the description below. Because we're using eager execution, we can check to see what our data looks like after this, and you can see that now we have parsed dictionaries of ints with nice human-readable names. Each feature has been batched. So a feature that is a single number is a length-64 tensor, and we can see that our conversion of soil type results in a tensor with a shape of 64 by 40. We can also see that we have a single tensor for the class labels which has the category indices as expected. Just to keep our eyes on the big picture here, let's see where we are. We've taken our raw data and put it into a TensorFlow data set that generates dictionaries of feature tensors and labels. But something is still wrong with the integers we have as features here. Anyone care to venture a guess? We have lots of feature types-- some are continuous, some are categorical, some are one-hot encoded. We need to represent these in a way that is meaningful to an ML model. You'll see how to fix that using feature columns in part two of this series right here on YouTube. So don't forget to hit that subscribe button and I'll see you there. [MUSIC PLAYING]
B1 data data set soil model feature encoded TensorFlow high-level APIs: Part 1 - loading data 1 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary