Subtitles section Play video Print subtitles [APPLAUSE] ZAK STONE: Thank you very much. I'm delighted to be here today to talk to you about some of the fantastic TensorFlow Research Cloud projects that we've seen around the world and to invite you to start your own. Whether you're here in the room, watching on the livestream, or online watching this afterwards, any of you are welcome to get involved with TFRC. Just very briefly, since I'm sure you've heard this all today, the context is this massive improvement in computational capabilities driven by deep learning. So deep learning, and specifically these deep neural networks, are enabling many new applications that are exciting in all sorts of ways, touching all kinds of different data, ranging from images, to speech, to text, even full scenes. And the challenge that many of you are probably grappling with is that these new capabilities come with profound increases in compute requirements. A while back, OpenAI did a study where they measured the total amount of compute required to train some of these famous machine learning models over the past several years. And the important thing to notice about this plot is that it's actually a log scale on the compute axis. So there are tremendous increases in the total amount of compute required to train these state-of-the-art deep learning models over time. And there's this consistent trend up and to the right, that these new capabilities are being unlocked by the additional compute power, as well as lots of hard work by many researchers all around the world in this open community. So unfortunately, these tremendous demands for compute to meet these new opportunities opened up by deep learning are coming to us just as Moore's law is ending. We've benefited for decades upon decades in consistent increases in single-threaded CPU performance. But all of a sudden, now we're down to maybe 3% per year. Who knows? There could always be a breakthrough. But we're not expecting extraordinary year-upon-year gains from single-threaded performance, as we've enjoyed in the past. So in response to that, we believe that specialized hardware for machine learning is the path forward for major performance wins, cost savings, and new breakthroughs across all these research domains that I mentioned earlier. Now, at Google, we've developed a family of special-purpose machine learning accelerators called Cloud TPUs. And we're on our third generation now, and two of these generations are available in the cloud-- the second and the third generation. Just to give you a brief overview of the hardware that I'm going to be talking about for the rest of the session, we have these individual devices here-- Cloud TPU v2 and v3. And as you can see, we've made tremendous progress generation over generation-- 180 teraflops to 420 teraflops. We've also increased the memory from 64 gigabytes of high bandwidth memory to 128, which matters a lot if you care about these cutting-edge natural language processing models, like BERT, or XLNet, or GPT-2. But the most important thing about Cloud TPUs isn't just these individual devices, which are the boards that you see here with the 40 TPU chips connected to a CPU host that's not shown. It's the fact that these devices are designed to be connected together into multi-rack machine learning supercomputers that let you scale much further, and program the whole supercomputer across these many racks as if it were a single machine. Now, on the top here, you can see the cloud TPU v2 pod spanning four racks. The TPUs are in those two central columns, and the CPUs are on the outside. That machine got us to 11 and 1/2 petaflops, which you can also subdivide every which way, as you wish. And the TPU chips, in particular, are connected by this 2-D toroidal mesh network, that enables ultra fast communication. It's much faster than standard data center networking. That's a big factor in performance, especially if you care about things like model parallelism or spatial partitioning. But now, with the Cloud TPU v3 Pod, which is actually liquid-cooled, the picture wasn't big enough to hold all the racks. It spans eight racks out to the side, and it gets you up over 100 petaflops if you're using the entire machine simultaneously. On a raw op by op basis, that's competitive with the largest supercomputers in the world. Although, these TPU supercomputers use lower precision, which is appropriate for deep learning. Now, I've mentioned performance. I just wanted to quantify that briefly. In the most recent MLPerf training version 0.6 competition, Cloud TPUs were able to outperform on premise infrastructure. What you can see here is the TPU results in blue compared with the largest on-premise cluster results that were submitted to the MLPerf competition. And in three of the five categories that we entered, the Cloud TPUs delivered the best top line results, including 84% increases over the next entry in machine translation, which is based on transformer and object detection, which was an SSD architecture. Now, obviously, these numbers are evolving all the time. There's tremendous investment and progress in the field. But I just wanted to assure you that these TPUs can really deliver when it comes to high performance at scale. But today we're here to talk about research and expanding access to this tremendous computing power to enable researchers all over the world to benefit from it, and explore the machine learning frontier, make their own contributions to expand it. In order to increase access to cutting-edge machine learning compute, we're thrilled to have been able to create the TensorFlow Research Cloud, in order to accelerate open machine learning research, and hopefully to drive this feedback cycle, where more people than ever before have access state-of-the-art tools. They have new breakthroughs, they published papers, and blog posts, and open source code, and give talks, and share the results with others. That helps even more people gain access to the frontier and benefit from it. So we're trying to drive this positive feedback loop. And so, as part of that, we've actually made well over 1,000 of these Cloud TPU devices available for free to support this open machine learning research. If you're interested in learning more right now, you can go to g.co/tfrc. I'll also have more information at the end of the talk. This pool of compute-- the TFRC cluster-- involves not just the original TPU v2 devices that we included, but we've recently added some of the v3 devices of the latest generation, if you're really pushing the limits. And there's the potential for Cloud TPU pod access. If you've gone as far as you can with these individual devices, please email us, let us know, and we'll do our best to get you some access to TPU pods. The underlying motivation of all this is a simple observation, which is that talent is equally distributed throughout the world, but opportunity is not. And we're trying to change that balance, to make more opportunities available to talented people all around the world, wherever they might be. So we've had tremendous interest in the TFRC program so far. More than 26,000 people have contacted us interested in TFRC, and we're thrilled that we've already been able to onboard more than 1,250 researchers. And we're adding more researchers all the time, so if you haven't heard from us yet, please ping us again. We really want to support you with TFRC. The feedback loop is just starting to turn, but already I'm happy to announce that more than 30 papers in the academic community have been enabled by TFRC, and many of these researchers tell us that without the TFRC compute, they couldn't possibly have afforded to carry out this research. So I feel like, in a small way, we grabbed the lever of progress here, and we have tipped it slightly upward. So the whole field is moving just a little bit faster, and we really thank you all for being part of that. I'm most excited though to share some of the stories directly, of the individual researchers and the projects that they've been carrying out on the TFRC Cloud TPUs. Now these researchers, they come from all over the world. I only have time to highlight four projects today, but the fantastic thing is that three of these researchers have been able to come and travel here to be with us in person. So you'll get to hear about their projects in their own words. We'll start with Victor Dibia, here in the upper left. Welcome, Victor. Come on up. [APPLAUSE] VICTOR DIBIA: Hi. Hello, everyone. Really excited to be here. My name is Victor Dibia. I'm originally from Nigeria. And currently, I'm a research engineer with Cloudera Fast Forward Labs in Brooklyn, New York. And so, about a year ago, I got really fascinated about this whole area of the intersection of art and AI. And given my background and my interest in human-computer interaction and applied artificial intelligence, it was something I really wanted to do. Right about the time, I got the opportunity to have access to TFRC, and today, I'm going to talk to you about the results of those experiments, and some of the research results I had working on this. And so, why did I work on this project? As a little kid growing up in eastern Nigeria, my extended family and I would travel to our village once a year. And one of the interesting and captivating part of those trips was something called the Eastern Masquerade dances of Africa. And so, what would happen is that there is these dancers with these complex, elaborate masks. And as a kid, I was really fascinated, and so, this project was a way to bridge my interest in technology arts. And as a research engineer too, I could also express my identity through a project like this. In addition to this, there's this growing area of AI inspired art, or AI generated art. But one thing you'll notice in that space is that most of the data sets that are used for this sort of explorations are mainly classical European art-- Rembrandt, Picasso. And so, a project like this is a way to diversify the conversations in that area. And then, finally, the researchers working in the generative model domain, one of the goals here is to also contribute to more complex data sets, compared to things like faces or retail images. And it can be a really interesting way to benchmark some of the new generated models that are being researched today. So what did I do? So like all of us know, the best results or most of the effort from machine learning projects comes from the data collection phase. So I started out collecting images. I curated about 20,000 images, and then resolved that down to about 9,300 high-quality images. And at this point, I was ready to train my model. And so, the beautiful thing is that the TensorFlow team had made available a couple of reference models. And so, I started my experiment using an implementation using DCGANs implemented with TensorFlow TPUEstimator. And so, the picture you see on the right is just a visualization of the training process for deep convolution again. So it starts out as random noise, and as training progresses, it learns to generate images that are really similar to the input data distribution. And so, starting out with the reference implementation, there's two interesting things that I did. The first was to modify the configuration of the network, modify the coder and decoder parameters to let this thing generate larger images-- 640px, 1280px-- and then [INAUDIBLE] data input pipeline lets me feed my data set into the model, and get it trained. And so, the thing you should watch out for is to ensure that your data input pipeline matches what the reference model implementations is expecting. It took me about 60 experiments to really track down the error and fix it. So it did take a couple of days to fix that. So in all, I ran about 200 experiments. And at this point, this is where TensorFlow Research Cloud really makes a difference. And so, something like this would take a couple of weeks to get done, but I was able to run most of these experiments, once all bugs were fixed, within about one or two days. And so, at this point, all the images you see here, they look like masks. But the interesting thing is that none of them are real. None of them exist in the real world. And these are all interesting artistic interpretations of what an African mask could look like. And so, what could I do next? And so, I started to think, at this point, I have a model that does pretty well. But the question is, are the images novel? Are they actually new stuff? Or has model just memorized some of my input data sets and regurgitated that? And so, to answer those questions I took a deep semantic search approach, where I used a pre-trained model, VGG16. So I tracked features from all of my data sets and all of my generated images, and I built this interface that allows some sort of algorithmic art inspection, where for each generated image, I can find the top 20 images in the data set that are actually similar to that image. So this is one way to actually inspect the results from a model like this. So going forward, the best and the stable model I was able to train could only generate 640px images. But can we do better? So it turns out that you could use super resolution GANs. And this is just one of my favorite results, where we have a super resolution GAN from the Topaz gigapixel AI model. Here's another interesting result. And what you probably can't see very clearly here, is that there is detail in this super resolve image that really just does not exist in the low resolution image. So it's like a two-step interpretation using your networks. And so, if you're a researcher, or you're an artist, or a software engineer, interested in this sort of work-- ZAK STONE: Yeah, there we go. VICTOR DIBIA: Yeah, please go ahead. All of the code I used for this, it's all available online and there's a blog post that goes around to it. So thank you. ZAK STONE: Thank you very much. [APPLAUSE] Thanks very much, Victor. Next up, we have Wisdom. Come on up, Wisdom. Thank you. There you go. Here you go. WISDOM D'ALMEIDA: Thank you. Hi, everyone. I'm glad to be here. I'm Wisdom. I'm from Togo. I grew up there, and I'm currently a visiting researcher at Mila in Montreal. I'm doing research in grounded language learning, natural language understanding, under the supervision of Yoshua Bengio. So since the past year, I've been interested in medical [? pod ?] generation. And so, basically when you go to see a radiologist, you get your chest X-ray taken. And the radiologist tries, in a fraction of second, to interpret the X-ray, and produce radiology reports that has mostly dissections of findings and impressions. And findings are written observations from different regions of the chest. So basically saying if there's an abnormality of that region or not. And the impression section highlights the key clinical findings. So because this happens very fast and radiologists can commit mistakes, the AI community has been thinking of ways to augment radiologists with AI capacity to provide a third eye. And we have [INAUDIBLE] classifiers that work very well for that. The problem with this classification, is you are going from the image to the labels. So where's the step where we generate the radiology report? And this is what I've been interesting in. And [INAUDIBLE] like to try something like image captioning to generate the reports. Basically, condition language model on the input image, and maximize the log-likelihood. But this doesn't work very well on medical reports, because there is nothing in this formulation that ensures clinical accuracy of the report that are being generated, and this is a big problem. And this is what I've been interested to solve, and I found inspiration in grounded language learning. So in this setting, you have a model that received natural language instructions to achieve a task. And to correctly achieve the task, the model needs good natural language instructions. And I said, whoa, we can do the same thing for medical report generation. So on top of maximizing the log-likelihood, which would do in image captioning, we can also reward the language model based on how well its output was useful for a medical task, let's say classification, for instance. So here, the classifier takes the radiology report as inputs, and we can also add an image for superior accuracy. But what is interesting here is in a backward pass, we are updating language model parameters based on how well the output was useful for classification. And that is a good starting point for accuracy, because we are forcing the language models to output things that have enough and pertinent medical clues to ensure accurate diagnosis. And I trained this model on MIMIC-CXR data set, which is the largest to date, with both just radiographs and free-text radiology reports. And to train this, I needed extensive amount of compute. And I started this project as a master's student in India. And I was training on my laptop. I know, yeah, that was painful. So I applied to TRFC to have access to TPUs, and I got it. So I had suddenly many TPUs for free, to do a lot of experiment at the same time. And that was useful to iterate fast in my research, because I needed to reproduce the baselines, as well as optimize my proposed approach. So my estimated TFRC usage was 8,000-plus hours. I used v2 devices, virtual devices, and also got to try the v2 Pod devices. So I couldn't leave you without some results of the model. So this is a case of hiatal hernia. And if you talk to radiologists, they will tell you the main evidence for hiatal hernia is the retro cardiac opacity you can observe with that green arrow. And in the red box, you see the ground truth radiology report of this X-ray. So the green one is a baseline I reproduced and optimized the best I could. But you can see that the model completely misses out on the key findings. And that's a consequence of turning [? on the ?] log-likelihood. Because you are optimizing maximizing the confidence of the model, it will avoid taking risk, and would tell you 80% of the time that your X-ray's fine. So the blue box is my approach. And you can see that by forcing the language model to output things that are useful for classification, the model will find the right words to use to justify this case of hiatal hernia. So I'm very excited to have presented this work very recently at Stanford, at a Frontier of AI-Assisted Care Scientific Symposium, organized by Fei-Fei Li and the Stanford School of Medicine. And I'm excited for what's next with this project, and I'm thankful to the TFRC team for providing the resources that were used for this work. Thanks for having me. ZAK STONE: Thank you very much, Wisdom. [APPLAUSE] That's great. Next up, we have Jade. JADE ABBOTT: Hi, everyone. I'm Jade Abbott. I'm from South Africa, so I've come a very long way to be here. I work for company called Retro Rabbit, but I'm not actually here today to talk about what I do at work, which these days is a lot of birth stuff. I'm here to talk about my side project, and I've picked up research as a hobby. And what we're trying to do is work on-- there's a lot of African languages, which I'll speak a little bit later, and very little research. So what we did here is we developed at least four or five of the Southern African languages, developed baseline models. This work feeds into a greater project which we call Masakhane. Masakhane means, "we build together," in isiZulu. And in this project we're trying to change the NLP footprint on the continent. So the problem, we have over 2,000 languages, which is quite insane. Many of these languages are exceptionally complex, some of the most complex in the world. And in contrast, we've got almost no data. If we do have data, we have to dig to find it. And what's even worse is that there's absolutely no research. So if you're a beginner NLP practitioner, currently learning about machine translation or NLP on the continent, and you do a search, and you're trying to find something in your language, there's nothing, right? You can look in some obscure journals, and you'll find maybe some old linguistic publications, and that's the extent of it. And this makes it hard if you're trying to build on models, and you're trying to spur this research. If you look at this graph, you can see what is the normalized paper count by country at the 2018 NLP conferences. And the more orange it is, the more papers we've got. And you see there's a glaringly empty continent in the middle there. And even at a widening NLP workshop at ACL, the graph still doesn't look that much different. And that's meant to be inclusive of more people from around the world. So what did we do? I like to say we took some existing data that we scrounged around and found, and we took the state-of-the-art model, and we smashed them together. They've never seen each other, this model and this data. And what we did was we then decided to actually try to optimize the NMT algorithms, just parameter-wise, just to work better on these low-resourced African languages. And our goal for this is to spur that additional research-- because right now there's nothing-- provide these baselines. And this is where TFRC came in. Like I said, this is my side project. So instead of having lots of money to do this, I'd actually tried, and I was renting GPUs from a cloud provider, and they were costing me an arm and a leg. I reached out to TFRC, and they were super happy to lend us these TPUs. We basically used Tensor2Tensor framework to train up these models. And we used government data, parallel corpuses that we managed to find there. One of the things that we found that was actually simultaneously presented at ACL-- on a different language pair, English to German-- we found that optimizing this by comparing coding tokenization allows these very complex low-resource languages, allows us to handle their agglutinative nature. Now, agglutination is when we take those languages which you build up new words by just adding on more words, or like switching little bits completely changes the meaning. That's agglutination, and optimizing this parameter can make really significant differences in the BLEU score. And what was also great is we needed to submit something, I think in two or three weeks, to a workshop. And instead of taking days to run these experiments, it would take a couple of hours to actually build these models. So yes, thank you to TFRC for that. So just some results overview of the five languages we had. You can see northern Sotho, Setswana, and Xitsonga we have almost-- in particular in the case of Setswana-- almost double the BLEU score. Bigger is better with BLEU. Where in Afrikaans-- Afrikaans is actually a European-based language. It's based on Dutch. That preferred the older statistical machine translation architecture. And we're better there. That unfortunately, runs on CPU, and actually takes a lot longer than we wanted it to. And isiZulu is an anomaly. It's a complex language, but we had very few sentences, and the sentence is also very messy. So in this case, the statistical translation performed slightly better. If you look at this little visualization, it's English to Afrikaans. But what's really cool is you can see the attention actually captured some of the language structures. Here, we've got a particular instance where "cannot" in Afrikaans becomes two words and "nie," and at the end, you have to say "nie" again. It's called a double negative. I'm not sure why. And you can see that it actually does that. Less so on this screen. For some reason there's a yellow there. You can pick up cannot matches to "can" "nie" and "nie." And those are the words, which is quite cool. Here, we've got one of our sample translations. So we've got a source sentence, we've got the reference translation and the transformer translation. Obviously, very few of you are likely to speak Setswana in the audience, so we've got a native speaker to actually translate it back to English, what the transformer generated. And you can see they're talking about sunflower, fields, and lands, and flowering periods, and they've picked up blossoming period. So you can see that it's actually done really, really, really well, despite having so little data. So yeah, this is my call to action. I work on a project, as I said, a co-leader of a project called masakhane.io. You can go check it out. And our idea is to basically change this map. So this map is what our current representation of researchers across the African continent are currently working on languages from. And like I said, the idea is to spur research. So if you know a language from Africa, or even if you don't, and you're willing to contribute time, or resources, or advice-- we've got a lot of very junior teams who don't have supervisors or people who work in machine translation. Or even if you'd like to come to a webinar, drop us a message. And yeah. Thank you very much to TFRC for hosting us. And, yeah, I look forward to what else we can actually build. ZAK STONE: Thanks so much, Jade. [APPLAUSE] Thank you. Oh, let me get the clicker too. Jade? Thank you so much. Let's have round of applause to Victor, Wisdom, and Jade for coming to represent their research. [APPLAUSE] Thank you so much. It's really a pleasure to have you here. So there's one more project I want to show. Jonathan wasn't able to be here in person, but this is just fantastic work, and I wanted to showcase it. So Jonathan and his colleagues at MIT won the best paper award at ICLR with a paper called the lottery ticket hypothesis, where they're looking for these sparse trainable neural networks within larger neural networks. Now, that had nothing to do with TFRC, but it was this really interesting idea. Many of the neural networks that we're used to have this tremendous number of parameters. They're very large, and one thing that the open AI graph earlier didn't show is that these neural networks are getting larger and larger over time. There's generally a correlation between larger model sizes and higher accuracy, as long as you have enough training data. But it takes more and more compute power, again, to train these larger and larger networks. So Jonathan and his colleagues asked this question, what if you could find just the right sub-network in this much larger network, that could perform the same task as the larger network, ideally to the same accuracy? And those networks are somewhat whimsically called these lottery tickets. So at ICLR, Jonathan used small networks, because that's what he could afford, to show some initial encouraging evidence for this hypothesis. But the real interesting part of this research, at least from my perspective, since I'm into Big Compute is, does it work at scale, right? And so, to find that out, Jonathan got in touch with us here at TFRC to try to scale up this work. And he was kind enough to say at the bottom here, that for his group, research at this scale would be impossible without TPUs. So let me share a little bit more about his work and then about his findings. So the lottery ticket hypothesis, as I mentioned before, is related to this broader category of techniques called pruning. And to be clear, there are many approaches to pruning neural networks, but most of these approaches take place after the networks have been trained. So you've already spent this compute time and cost to get the network trained, and then you modify the trained model to try and set weights to zero, or reduce the size of the model, or distill it as another approach into a smaller model. But it's interesting to ask, could you just train a smaller network from the start? Could you prune connections early on, maybe at the very beginning, or at least early in the training process, without affecting the learning too much? So like I said, this initial paper showed some very promising results on small networks, on small data sets. So with the TFRC Cloud TPUs, Jonathan took this to models we're all familiar with-- ResNet-50 trained on ImageNet-- and he found slightly different behavior, but was able to validate the hypothesis. So you can't go all the way back to the beginning, you can't prune all the weights-- at least with current understanding, there may be further breakthroughs-- but you can go almost back to the first epoch, cut the network down, and then train from there with a much smaller network without any harm to accuracy. In particular, Jonathan found that with ResNet-50, you could remove 80% the parameters at epoch 4, and not hurt the accuracy at all. And you're training to something like 90 epochs or further. So this is a real compute savings. And there's these plots down below with ResNet-50 and Inception showing you this rewind epoch, and showing that the test error stays low until you get past rewind epoch 4 or 3. One thing I really appreciated about Jonathan's work is, in addition to carrying out these experiments, and publishing them, and sharing them with the community, and inspiring other research, he built some interesting tools to help manage all the compute power. So what you're seeing here is actually a Google Sheet, so a spreadsheet that Jonathan wired up with scripts to orchestrate all of his experiments. So this was a fully declarative system, at the end of the day. He could add a row to the spreadsheet, and behind the scenes his script would kick off a new experiment, monitor the results, bring them back into the spreadsheet, flag error conditions if anything had gone wrong. And at the end of the day, this spreadsheet had thousands upon thousands of rows, showcasing all the different experiments that were then searchable, and sharable, and usable in all the ways that a spreadsheet is. So I thought this was a great mix of old technology and new. And this was a serious amount of compute. Jonathan estimates that he used at least 40,000 hours of Cloud TPU compute on TFRC. So I hope that underscores that we're really serious about providing a large amount of compute for you to do things that you couldn't do otherwise, and then share them with the research community. So it's not just about the projects that you've heard today. These are just samples of the thousands of researchers who are working on TFRC. And I'd really like to personally encourage all of you to think about your next research project happening on TFRC. And it can be an academic project, or it can be a side project, it can be an art project-- as long as it's intended to benefit the community, as long as you're going to share your work with others, make them open, and help accelerate progress in the field, we'd love to hear from you. So if you're interested in getting started right now, you can visit this link below-- g.com/tputalk and enter code TFWORLD. That'll go straight to us and the organizers of this event. And we're happy to make available, as a starting point, five regular cloud TPUs and 20 preemptible cloud TPUs for several months for free. The rest of Google Cloud Services still cost money. So this is not completely free, but the TPUs are the overwhelming majority of the compute cost for most of these compute-intensive projects. And so, we really hope that this enables things that you couldn't do otherwise. And if you get to the limits of what you can do with this initial quota, please reach out. Let us know. Tell us about what you're doing. Tell us what you'd like to do. We can't promise anything, but we'll do our best to help with more, maybe even a lot more compute capacity, including the access to Pods that I mentioned earlier. So thanks again to all of you for being here today, for joining on the livestream or online. Thanks to our speakers for representing their work in person, and we'll all be happy to hang out here afterwards and answer any of the questions you might have for those of you who are here in the room. Please rate this session in the O'Reilly events app. And thank you all very much. Hope you're enjoying TF World. [APPLAUSE]
B1 compute tpus cloud model research tpu Great TensorFlow Research Cloud projects from around the world (TF World '19) 4 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary