Placeholder Image

Subtitles section Play video

  • ♪ (music) ♪

  • Hi everybody and welcome

  • to this episode of TensorFlow Meets.

  • I'm Laurence Moroney and I'm really delighted

  • to have Ryan Sepassi from the Google AI research team.

  • And Ryan, you work on TensorFlow Datasets, right?

  • That's right.

  • - Could you tell us a bit more about it? - Absolutely.

  • TensorFlow Datasets is a new library,

  • it's on PyPI and GitHub,

  • and what it aims to do is standardize the interface

  • to a lot of public research datasets.

  • So we've actually shipped already with about 30 datasets

  • and we're adding more every day.

  • Nice, now one of the things that I found that,

  • particularly when trying to learn TensorFlow,

  • that a lot of the codes that you see in tutorials is all about,

  • "Here's where you go to get the data and then you download it

  • and then you unzip it here and then you put these files here

  • and then you get the features from here,

  • and the labels from there," and that kind of stuff.

  • Now part of TFDS is really to try and simplify all of that, right?

  • That's absolutely right. We noticed the same thing.

  • Researchers and folks who do machine learning,

  • one of the first things they have to do

  • is just clean the input data right off the bat.

  • And input data is in all sorts of formats,

  • which of course makes sense

  • when you're generating data at rest to share it,

  • it makes sense to have it in a certain format.

  • But we as machine learning practitioners,

  • really want data in a format

  • that's ready to feed into a machine learning pipeline.

  • And so TFDS knows the format of all the source datasets,

  • pre-processes them, puts them in a standard format,

  • and now that standard format

  • is ready to feed into your machine learning pipeline.

  • So it should help both advanced practitioners

  • and people who are just trying to get started,

  • because now it's the same API,

  • it's like a one-liner, to get a ton of datasets.

  • Right and one of the really nice things

  • about that that I found was that,

  • like for people who are just trying to get started,

  • all of the tutorials all seem to have the same old datasets, right?--

  • It was like MNIST handwriting, it was Fashion-MNIST--

  • because they were relatively easy to use

  • and they were relatively easy to get the data and to actually use it.

  • Now we can actually have a little bit more variety.

  • (Ryan) Yes, absolutely!

  • Yeah, break out of the box of MNIST and Fashion-MNIST--

  • Not that there's anything wrong with them--

  • No, no they're fantastic datasets.

  • They spurred the field on a lot, datasets always do,

  • but it's great to get, yeah, some variety.

  • Especially for beginners, you start out with these small datasets

  • but it's nice to be able to graduate

  • to sort of larger problems, larger models,

  • and now at least for the data portion,

  • you're not going to have to change very much.

  • Yeah, and total shameless self-plug here

  • but I've been teaching a course on Coursera,

  • and one of the things that was the feedback

  • when we were designing this course was that,

  • "Hey, we just don't want to do the same old datasets again.

  • How do we get new datasets?"

  • And there was so much work involved

  • and maybe find a dataset from Kaggle and then work on that,

  • and it was like, "No, let's see if we can build our own dataset

  • and contribute it back in."

  • So we have a learning datasets that somebody can use,

  • we'll contribute it back in,

  • and now the whole community can use that dataset,

  • And the process behind that was pretty seamless and pretty painless.

  • Yeah, that's fantastic.

  • Yeah, super glad that TensorFlow Datasets could be helpful there.

  • Now I can say I'm a little spoiled that I had somebody help me.

  • There was a developer actually,

  • and Google helped me to get all the metadata

  • that I needed to be able to publish my set into it.

  • But how easy or how difficult it is

  • if somebody wants to do it themselves?

  • Yeah, it should be really straightforward.

  • We've tried to make it as easy as possible

  • and we actually have a really detailed guide that you can follow.

  • And lots of people externally have actually now contributed datasets

  • by following this guide.

  • If you go to the tensorflow.org/datasets site

  • we have a link to a guide to add a dataset

  • and if you follow it step by step,

  • it's pretty straightforward to get one in.

  • The essential piece is to iterate through the source data.

  • So whatever the format of the source data is--

  • it's NumPy arrays, it's in a binary format,

  • or in pickled format, or whatever it is--

  • you just iterate through it and yield records

  • and then document exactly what the features are--

  • I have an image feature;

  • I have a class label feature, it's got these classes;

  • I've got a text feature, I want to use this vocabulary.

  • And so those two things, just specifying the metadata

  • and actually generating records out,

  • are pretty much all you need to add this dataset

  • into TensorFlow Datasets.

  • The rest is handled by the library.

  • And then once it's in there, your data is famous.

  • Anybody who can pip install TensorFlow Datasets

  • can now load whatever dataset you just added.

  • Yep. I'm really excited to see how many people use my datasets.

  • (Ryan) Yes! Yeah, yeah that's right.

  • Horses or Humans, and I did a Rock, Paper, Scissors.

  • Awesome, yeah, so everybody out there

  • make sure to use these datasets,

  • binary classification tasks.

  • Yeah, exactly, and what was really interesting for these was that,

  • I wanted to experiment with...

  • These were image classification, but I created photoreal CGI,

  • and then to try and train a neural network using photoreal CGI

  • and then see if it could classify real photos.

  • - Ah, transfer over! - Yeah.

  • Yeah fantastic. How did it go?

  • - It worked great! - Amazing!

  • So my horses and humans are all CGI

  • and then I train a binary classifier with them,

  • try to put in a picture of a real horse or a real human--

  • Oh, that's awesome. What did you use for the CGI?

  • Like how photorealistic was it,

  • and how did you get them to be so photorealistic?

  • Basically there are some rendering tools

  • that you can get, that are out there--

  • there's a whole suite of different ones

  • and I'm not going to promote any particular ones--

  • and that you can use to just design CGI

  • and actually render them.

  • It started as like a little science experiment

  • and then it was like, "Wow, this actually works."

  • But now TFDS, what that will do

  • is actually allow me to share that with everybody else,

  • instead of a random blog somewhere

  • or maybe post them on Kaggle or something like that,

  • it can actually be published through to you.

  • Have a script that says here's how you download the data,

  • here's how you prepare it and all that, yeah.

  • (Laurence) So I'm really excited to see

  • with this data what people are going to do

  • and maybe they can train a far better classifier than I did.

  • Now are there any other datasets that are in TFDS

  • that particularly inspire you, any that excite you that--?

  • Yeah, so the ones that I'm actually most excited about right now

  • are ones that we're actively developing.

  • One of the things were adding is support to generate data

  • in parallel with Apache Beam.

  • So this will allow us to ingest really large datasets.

  • So some of the datasets in the pipeline

  • are like all of Wikipedia in every language,

  • Common Crawl, which is a public crawl of the web,

  • and these are really exciting datasets

  • but are not really possible to use

  • without having parallel data generation.

  • But we're working on that and we'll have them really soon.

  • And so those are the things I'm most excited about

  • because enormous amounts of data

  • combined with enormous models

  • seem to be really promising direction

  • and I'm really looking forward to seeing the sorts of results

  • that we get with giant models on giant datasets.

  • Yeah, so from the very basic entry level

  • to have more datasets that people can learn from

  • all the way up to the extreme research level,

  • to have like these kind of mega datasets that people can do.

  • It's really interesting to see

  • how much TFDS is going to kind of power research.

  • (Ryan) Yeah. I'm really excited to see that people--

  • we've gotten a lot of positive feedback and that's felt great,

  • and yeah, within research

  • and all the way from advanced researchers

  • to beginners just learning machine learning

  • seems to be a good utility library for lots of folks.

  • So anybody who wants to start using these

  • or wants to start contributing, where should they go?

  • Absolutely. If you want to contribute,

  • it's so easy to contribute, to get started contributing.

  • You should show up on GitHub.

  • We have GitHub issues for all the different datasets

  • that have been requested.

  • And you can just comment on it, and we'll assign it to you,

  • and you follow the guide and you can contribute a dataset.

  • And, of course, if you just want to use it,

  • then just pip install TensorFlow datasets,

  • go to the TensorFlow website, find us on GitHub,

  • and yeah, happy datasetting.

  • Happy datasetting, I like that.

  • - So, Ryan, thanks so much - Thank you Laurence.

  • As always, this is inspiring, very informative, great stuff.

  • And thanks, everybody, for watching this episode of TensorFlow Meets.

  • If you've any questions for me or if you have any questions for Ryan,

  • just please leave them in the comments below,

  • and I'll put links to everything that we discussed

  • in the description for this video.

  • So thanks so much.

  • ♪ (music) ♪

♪ (music) ♪

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it