Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • KARMEL ALLISON: Hi, and welcome to Coding TensorFlow.

  • I'm Karmel Allison, and I'm here to guide you

  • through a scenario using TensorFlow's high-level APIs.

  • This video is the first in a three-part series.

  • In this, we'll look at data and, in particular,

  • how to prepare and load your data for machine learning.

  • The rest of the series is available on this channel,

  • so don't forget to hit that subscribe button.

  • Building a machine learning model is a multi-stage process.

  • You have to collect, clean, and process your data, prototype

  • and iterate on your model architecture,

  • train and evaluate results, prepare your model

  • for production serving, and then, you

  • have to do it all over again because the model is

  • a living thing that will have to be updated and improved.

  • TensorFlow high-level APIs aim to help you at each stage

  • of your model's lifecycle--

  • from the beginning of an idea to training and serving

  • large-scale applications.

  • In this series, I will walk through the key steps

  • in developing a machine learning model

  • and show you what TensorFlow provides for you at each step.

  • And then, I'll also cover some of the new developments

  • that we are working on to continue

  • to improve your workflow.

  • We start with the problem--

  • an associated data set.

  • We will use the covertype data set from the US Forestry

  • Service and Colorado State University which

  • has about 500,000 rows of geophysical data collected

  • from particular regions in National Forest areas.

  • We are going to use the features in this data set

  • to try to predict the soil type that was found in each region.

  • And there are a mix of features that we'll be working with.

  • Some are real values--

  • elevation, slope, aspect, and so on.

  • Some are real values that have been binned

  • into an 8-bit scale, and some are categorical values

  • that assign integers to soil types and wilderness area

  • names.

  • If we inspect the first couple rows of our data,

  • this is what we see--

  • integers, no header, so we have to work from the Info file.

  • OK, so here we can see that we have some of our real values,

  • and it looks like some of the categorical values

  • are one-hot encoded, and some are just categories.

  • Some features span multiple cells,

  • so we'll have to handle that.

  • Where do we start?

  • What's the first thing we should do here?

  • I'm going to suggest to you that when you're prototyping

  • a new model in TensorFlow, the very first thing you should do

  • is enable eager execution.

  • It's simple.

  • You just add a single line after importing TensorFlow,

  • and you're good to go.

  • The way it does that is rather than deferring

  • execution of your TensorFlow graph, it runs ops immediately.

  • The result is that you can write your models in eager

  • while you're experimenting and iterating,

  • but you still get the full benefit of TensorFlow graph

  • execution when it comes time to train and deploy

  • your model at scale.

  • The first thing we're going to want to do

  • is load our data in and process the data in columns so that we

  • can feed it into a model.

  • The data is a CSV file with 55 columns of integers.

  • We'll go over each of those in detail in a bit,

  • but first we will use the TensorFlow CSV data

  • set to load our data from disk.

  • This particular data set doesn't have a header, but if it did,

  • we could process that as well with the CSV data set.

  • Now, a TensorFlow data set is similar to a NumPy

  • array or a Pandas DataFrame in that it reads and processes

  • data.

  • But instead of being optimized for in-memory analysis,

  • it is designed to take data, run the set of operations

  • that are necessary to process and consume

  • that data for training.

  • Here, we are telling TensorFlow to read our data from disk,

  • parse the CSV, and process the incoming data

  • as a vector of 55 integers.

  • Because we are running with eager execution enabled,

  • our data set here does already represent our data,

  • and we can even check to see what each row currently

  • looks like.

  • If we take the first row, we can see that right now, each row is

  • a tuple of 55 integer tensors--

  • not yet processed, batched, or even split

  • into features and labels.

  • So we have tuples of 55 integers,

  • but we want our data to reflect the structure of the data we

  • know is in there.

  • For that, we can write a function

  • to apply to our data set row by row.

  • This function will take in the tuple

  • of 55 integers in each row.

  • A data set is expected to return tuples of features and labels.

  • So our goal with each row is to parse the row

  • and return the set of features we care about plus a class

  • label.

  • So what needs to go in-between here?

  • This function is going to be applied at runtime

  • to each row of data, but it will be applied efficiently

  • by TensorFlow data sets.

  • So this is a good place to put things like image

  • processing or adding random noise

  • or other special transformations.

  • In our case, we will handle most of our transformations

  • using feature columns which I will explain more in a bit,

  • so our main goal in the parsing function

  • is to make sure we correctly separate and group

  • our columns of features.

  • So for example, if you read over the details of the data set,

  • you will see that the soil type is a categorical feature that

  • is one-hot encoded.

  • It is spread out over 40 of our integers.

  • We combine those here into a single length-40 tensor

  • so that we can learn soil type as a single feature rather than

  • 40 independent features.

  • Then we can combine the soil-type tensor

  • with the other features which are spread out

  • over the set of 55 columns in the original data set.

  • We can splice the tuple of incoming values

  • to make sure we get everything we need.

  • And then we zip those up with human-readable column names

  • to get a dictionary of features that we

  • can process further later.

  • Finally, we convert our one hot-encoded wilderness area

  • class into a class label that is in the range 0 to 3.

  • We could leave them one-hot encoded

  • as well, and for some model architectures or loss

  • calculations that might be preferable.

  • And that gives us features and a label for each row.

  • We then map this function to our data row-wise,

  • and then we batch the rows into sets of 64 examples.

  • Using TensorFlow data sets here allows

  • us to take advantage of many built-in performance

  • optimizations that data sets provide

  • for this type of mapping and batching

  • to help remove I/O bottlenecks.

  • There are many other tricks for I/O performance optimization,

  • depending on your system, that we won't cover here,

  • but a guide is included in the description below.

  • Because we're using eager execution,

  • we can check to see what our data looks like after this,

  • and you can see that now we have parsed dictionaries of ints

  • with nice human-readable names.

  • Each feature has been batched.

  • So a feature that is a single number is a length-64 tensor,

  • and we can see that our conversion of soil type

  • results in a tensor with a shape of 64 by 40.

  • We can also see that we have a single tensor for the class

  • labels which has the category indices as expected.

  • Just to keep our eyes on the big picture here,

  • let's see where we are.

  • We've taken our raw data and put it

  • into a TensorFlow data set that generates dictionaries

  • of feature tensors and labels.

  • But something is still wrong with the integers

  • we have as features here.

  • Anyone care to venture a guess?

  • We have lots of feature types-- some are continuous,

  • some are categorical, some are one-hot encoded.

  • We need to represent these in a way that

  • is meaningful to an ML model.

  • You'll see how to fix that using feature

  • columns in part two of this series right here on YouTube.

  • So don't forget to hit that subscribe button

  • and I'll see you there.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it