Subtitles section Play video Print subtitles ♪ (music) ♪ Hi everybody and welcome to this episode of TensorFlow Meets. I'm Laurence Moroney and I'm really delighted to have Ryan Sepassi from the Google AI research team. And Ryan, you work on TensorFlow Datasets, right? That's right. - Could you tell us a bit more about it? - Absolutely. TensorFlow Datasets is a new library, it's on PyPI and GitHub, and what it aims to do is standardize the interface to a lot of public research datasets. So we've actually shipped already with about 30 datasets and we're adding more every day. Nice, now one of the things that I found that, particularly when trying to learn TensorFlow, that a lot of the codes that you see in tutorials is all about, "Here's where you go to get the data and then you download it and then you unzip it here and then you put these files here and then you get the features from here, and the labels from there," and that kind of stuff. Now part of TFDS is really to try and simplify all of that, right? That's absolutely right. We noticed the same thing. Researchers and folks who do machine learning, one of the first things they have to do is just clean the input data right off the bat. And input data is in all sorts of formats, which of course makes sense when you're generating data at rest to share it, it makes sense to have it in a certain format. But we as machine learning practitioners, really want data in a format that's ready to feed into a machine learning pipeline. And so TFDS knows the format of all the source datasets, pre-processes them, puts them in a standard format, and now that standard format is ready to feed into your machine learning pipeline. So it should help both advanced practitioners and people who are just trying to get started, because now it's the same API, it's like a one-liner, to get a ton of datasets. Right and one of the really nice things about that that I found was that, like for people who are just trying to get started, all of the tutorials all seem to have the same old datasets, right?-- It was like MNIST handwriting, it was Fashion-MNIST-- because they were relatively easy to use and they were relatively easy to get the data and to actually use it. Now we can actually have a little bit more variety. (Ryan) Yes, absolutely! Yeah, break out of the box of MNIST and Fashion-MNIST-- Not that there's anything wrong with them-- No, no they're fantastic datasets. They spurred the field on a lot, datasets always do, but it's great to get, yeah, some variety. Especially for beginners, you start out with these small datasets but it's nice to be able to graduate to sort of larger problems, larger models, and now at least for the data portion, you're not going to have to change very much. Yeah, and total shameless self-plug here but I've been teaching a course on Coursera, and one of the things that was the feedback when we were designing this course was that, "Hey, we just don't want to do the same old datasets again. How do we get new datasets?" And there was so much work involved and maybe find a dataset from Kaggle and then work on that, and it was like, "No, let's see if we can build our own dataset and contribute it back in." So we have a learning datasets that somebody can use, we'll contribute it back in, and now the whole community can use that dataset, And the process behind that was pretty seamless and pretty painless. Yeah, that's fantastic. Yeah, super glad that TensorFlow Datasets could be helpful there. Now I can say I'm a little spoiled that I had somebody help me. There was a developer actually, and Google helped me to get all the metadata that I needed to be able to publish my set into it. But how easy or how difficult it is if somebody wants to do it themselves? Yeah, it should be really straightforward. We've tried to make it as easy as possible and we actually have a really detailed guide that you can follow. And lots of people externally have actually now contributed datasets by following this guide. If you go to the tensorflow.org/datasets site we have a link to a guide to add a dataset and if you follow it step by step, it's pretty straightforward to get one in. The essential piece is to iterate through the source data. So whatever the format of the source data is-- it's NumPy arrays, it's in a binary format, or in pickled format, or whatever it is-- you just iterate through it and yield records and then document exactly what the features are-- I have an image feature; I have a class label feature, it's got these classes; I've got a text feature, I want to use this vocabulary. And so those two things, just specifying the metadata and actually generating records out, are pretty much all you need to add this dataset into TensorFlow Datasets. The rest is handled by the library. And then once it's in there, your data is famous. Anybody who can pip install TensorFlow Datasets can now load whatever dataset you just added. Yep. I'm really excited to see how many people use my datasets. (Ryan) Yes! Yeah, yeah that's right. Horses or Humans, and I did a Rock, Paper, Scissors. Awesome, yeah, so everybody out there make sure to use these datasets, binary classification tasks. Yeah, exactly, and what was really interesting for these was that, I wanted to experiment with... These were image classification, but I created photoreal CGI, and then to try and train a neural network using photoreal CGI and then see if it could classify real photos. - Ah, transfer over! - Yeah. Yeah fantastic. How did it go? - It worked great! - Amazing! So my horses and humans are all CGI and then I train a binary classifier with them, try to put in a picture of a real horse or a real human-- Oh, that's awesome. What did you use for the CGI? Like how photorealistic was it, and how did you get them to be so photorealistic? Basically there are some rendering tools that you can get, that are out there-- there's a whole suite of different ones and I'm not going to promote any particular ones-- and that you can use to just design CGI and actually render them. It started as like a little science experiment and then it was like, "Wow, this actually works." But now TFDS, what that will do is actually allow me to share that with everybody else, instead of a random blog somewhere or maybe post them on Kaggle or something like that, it can actually be published through to you. Have a script that says here's how you download the data, here's how you prepare it and all that, yeah. (Laurence) So I'm really excited to see with this data what people are going to do and maybe they can train a far better classifier than I did. Now are there any other datasets that are in TFDS that particularly inspire you, any that excite you that--? Yeah, so the ones that I'm actually most excited about right now are ones that we're actively developing. One of the things were adding is support to generate data in parallel with Apache Beam. So this will allow us to ingest really large datasets. So some of the datasets in the pipeline are like all of Wikipedia in every language, Common Crawl, which is a public crawl of the web, and these are really exciting datasets but are not really possible to use without having parallel data generation. But we're working on that and we'll have them really soon. And so those are the things I'm most excited about because enormous amounts of data combined with enormous models seem to be really promising direction and I'm really looking forward to seeing the sorts of results that we get with giant models on giant datasets. Yeah, so from the very basic entry level to have more datasets that people can learn from all the way up to the extreme research level, to have like these kind of mega datasets that people can do. It's really interesting to see how much TFDS is going to kind of power research. (Ryan) Yeah. I'm really excited to see that people-- we've gotten a lot of positive feedback and that's felt great, and yeah, within research and all the way from advanced researchers to beginners just learning machine learning seems to be a good utility library for lots of folks. So anybody who wants to start using these or wants to start contributing, where should they go? Absolutely. If you want to contribute, it's so easy to contribute, to get started contributing. You should show up on GitHub. We have GitHub issues for all the different datasets that have been requested. And you can just comment on it, and we'll assign it to you, and you follow the guide and you can contribute a dataset. And, of course, if you just want to use it, then just pip install TensorFlow datasets, go to the TensorFlow website, find us on GitHub, and yeah, happy datasetting. Happy datasetting, I like that. - So, Ryan, thanks so much - Thank you Laurence. As always, this is inspiring, very informative, great stuff. And thanks, everybody, for watching this episode of TensorFlow Meets. If you've any questions for me or if you have any questions for Ryan, just please leave them in the comments below, and I'll put links to everything that we discussed in the description for this video. So thanks so much. ♪ (music) ♪
A2 datasets data dataset ryan format contribute Working with TensorFlow Datasets (TensorFlow Meets) 1 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary