Subtitles section Play video Print subtitles WAHID BHIMJI: OK, so I'm Wahid. I'm not actually part of Google. I'm at Lawrence Berkeley National Lab. And I'm sort of going to tell you how we're using TensorFlow to perform deep learning for the fundamental sciences and also using high-performance computing. OK, so the fundamental sciences-- particle physics, cosmology, and lots of other things I'll explain in a minute-- make heavy use of high-performance computing at the center I work at, which is part of the Department of Energy, initially for simulation and data analysis, or traditionally. But progress in deep learning and tools like TensorFlow have really enabled the use of higher-dimensional data and through deep learning enable the possibility of new discoveries, faster computation, and actually whole new approaches. And I'm going to talk about that here, illustrating with a few examples of stuff we're running at NERSC. So what is NERSC? It's the high-performance computing center for the Department of Energy Office for Science, which means we support the whole breadth of science that the DoE does, which actually includes things like not just cosmology or what you might think of as energy research, like batteries and so forth, but also materials and climate and genomics and things like this. So we have a huge range of users and a vast range of projects across a whole variety of science. And we have big machines. Cori, our latest machine was number five, and, in fact, the highest by P flops in the US when it was installed a couple of years ago. But things dropped down those numbers, and now it's number 10 in the top 500. OK, so we see the use of AI now across the whole science domain. So you probably can't see this slide very well, but this is a take on an industry part of machine learning, sort of splitting it into supervised learning and unsupervised learning, and classification and regression, and so forth. And there's kind of examples here that we see across the sciences. But I'll be mostly talking about particle physics and cosmology because actually that's my background. So I'm more comfortable with those examples. So what we're trying to do in these fields is really uncover the secrets of the universe. So this is a sort of evolution from the Big Bang to the present day. And there's planets and galaxies and stuff. And they've all been influenced by, for example, dark matter and dark energy over this evolution. So obviously our understanding of this has come a long way in recent years. But there's still plenty of mysteries and things that we don't know about, like the very things I just mentioned, like dark matter. What is the nature of dark matter? And what is the relationship between particle physics that explains extremely well the very small, and even everything around us, but yet, breaks down at cosmological scales? So in order to answer those kind of questions, we have huge complex instruments, such as the planned LSST telescope on the left, which is going to look at the very big in resolutions that currently I'm presenting. And the ATLAS Detector at the LHC on the right, which is at the Large Hadron Collider, so at the Swiss, French border, it's a detector the size of a building. There's little people there. You can sort of see them. And it has hundreds of millions of channels of electronics to record collisions that occur in the middle every 25 nanoseconds, so a huge stream of data. So both of these experiments have vast streams of data. Really the ATLAS experiment has processed exabytes of data over its time. And this has to be filtered through a process of data analysis and so forth. And so if you get like high-resolution images or very detailed detector outputs, the first stage is to kind of simplify these, maybe build catalogs of objects in the sky, such as stars and galaxies, or in the case of particle physics to kind of combine these into what particles might have been produced, and these lines or tracks and deposits that have occurred in the detector. So this obviously also involves a large amount of computing. So computing fits in here. But computing also fits in because the way that these analyses are done is to compare with simulated data, in this case cosmology simulations that are big HPC simulations done for different types of universes that might have existed depending on different cosmology parameters. And in the particle physics case, you do extremely detailed simulations because of the precision you require in terms of how the detector would have reacted to, for example, a particle coming in here. So here you've got all this kind of showering of what would have happened inside the detector. And then from each of these, you might produce summary statistics and compare one to the other and what are secrets of the universe, I guess, such as the nature of dark matter, or new particles at the LHC. OK, so you might have guessed there's many areas where deep learning can help with this. So one is classification to, for example, find those physics objects that I showed identified in collisions at the LHC. Or, indeed, just to directly find from the raw data what was interesting and what was not interesting. Another way you might use it is regression to, for example, find what kind of energies were deposited inside the detector or what were the physics parameters that were responsible for stuff that you saw in the images at the telescopes. Another is sort of clustering feature detection in a more unsupervised way, where you might want to look for anomalies in the data, either because these are signs of new physics or because they're actually problems with the instruments. And then another last way, perhaps, is to generate data to replace the full physics simulations that I just described, which are, as I mentioned, extremely computationally expensive. So I'll give a few examples across these domains, OK? So the first is classification. So there you're trying to answer the question, is this new physics? For example, supersymmetry, which is also a dark matter candidate. So I could give you the answer later whether that's actually new physics or not. So here the idea is kind of be-- there's several papers now that are exploiting this kind of idea is to take the detector, like the ATLAS detector, and sort of unroll-- so it's a cylindrical detector with many cylindrical layers, and to take that and unroll it into an image where you have phi along here, which is the angle around this direction, and eta this way, which is a physicist's way of describing the forwardness of the deposit. And then just simply put it in an image, and then you get the ability to use all of the kind of developments in image recognition, such as convolutional neural networks. So here's what is now a relatively simple sort of convolutional network, several convolution and pooling layers, and then fully connected layers. And we've exploited this at scales, either taking 64 by 64 images to represent this, or at the 224 by 224, sort of closer to the resolution of the detector. And then you can also have multiple channels which correspond to different layers of this cylindrical detector. And we saw that it kind of works. So here you have a rock curve of the two class probability, so false positive rate. So the first thing to notice is that you need a very, very high rejection of the physics that you already know about because that vastly dominates. So actually, the data's even been prefiltered before this. So you need a low false positive rate. And generally, of course, in these rock curves, higher this way is better. So this point represents the physics selections that were traditionally used in this analysis. These curves here, like incorporating those higher level physics variables. But in shallow neural net-- in shallow machine learning approaches. And then the blue line shows that you can gain quite a lot, and this from using convolution neural networks, which gives you access. It's not just the technique of deep learning, but also being able to use all the data that's available. And then, similarly, there's another boost from using three channels, which corresponds to the other detectors. OK, so I'd just like to caveat a few of these results. So even though there's obvious gains to be had here, these analyses are still not currently used in the Large Hadron Collider analyses. And part of the reason for that is because this is trained on simulation, whereas-- and it might pick up on very small aspects of the simulation that differ from the real data. So there's several work that's carrying on from this to look at how to incorporate real data into what's done. But also, I think methodological developments in how to interrogate these models, and really discover what they're learning would be kind of useful for the field. OK, so taking a regression problem now, the kind of question you might ask is, what possible universe would look like this? So this is an image from the Sloan Digital Sky Survey. So here the Earth's in the middle, and increasing redshift takes you out this way. So this is like the older universe over here. And you can kind of see. And the histogram is a structure of galaxies' density. So there's like more-- you can see that structure sort of appears as the universe evolves. And that kind of evolution of structure tells you something about the cosmology parameters that were involved. So can you actually regress those parameters from looking at these kind of distributions? So the idea here was to take a method that was developed by these people in CMU, which is to use, again, a convolutional neural network. But here, running on a 3D-- this is actually simulated data, so 3D distribution of dark matter in the universe. And these can be large datasets. So part of the work we did here was to kind of scale this up and run on extremely large data and across the kind of machines we have at Cori on 8,000 CPU nodes, which is kind of some of the largest scale that TensorFlow has been run on in a data parallel fashion, and then being able to predict cosmology parameters that went into this simulation in a matter of minutes because of the scale of computation used. So, again, it's a fairly standard network, but this time in 3D. And actually, there was quite a bit of work to get this to work well on CPU, for example. Because that's-- sorry, I should have mentioned, actually, that the machine that we have is primarily composed of Intel Knights Landing CPU. So that's another area that we help develop. OK, so again, it kind of works. And so here they like to plot the true value of the parameter that's being regressed against the predicted value. So you hope that it would match up along the line. And, indeed, the points which come from a run of this regression do actually lie across this line in all the cases. Now, the actual crosses come from the larger-scale run. So the points come from a 2,000 node run of the network, and the crosses come from an 8,000 node run. So the sort of caveat here is that the 8,000 node run doesn't actually do as well as the 2,000 node run, which kind of indicates a point, which is another thing we're working on about convergence of these neural networks when running at a large distributed scale. So I'll come back to that a bit later as well. OK, so then the last example I have here is around generation. So, as I mentioned, simulations went into that previous analysis, and they go into all of the kind of analyses that are done at these-- for these kind of purposes. But in order to get some of these simulations, it actually takes two weeks of computational time on a large supercomputer of the scale of Cori, which is kind of a huge amount of resource to be giving over to these. And for different versions of the cosmology that you want to generate, you need to do more of these simulations. And part of the output that you might get from these is, this is a 2D mass map, which corresponds to the actual galaxy distribution that you might observe from data. So this is what you would want to compare with data. So the question is, is it possible to augment your data set and generate different cosmologies in a kind of fast simulation that doesn't require running this full infrastructure? And the way that we tried to do that was to use generative adversarial networks that were mentioned briefly in the last talk, where you have two networks that work in tandem with each other and that are optimized together, but against each other, and one is trying to tell the difference between. So the advantage here is that we have some real examples from the full simulation to compare with. And the discriminator can take those and say, does the generator do any good job of producing fake maps? And we used a pretty standard DCGAN architecture. And there was a few modifications in this paper to make that work. But at the time last year there weren't many people trying to apply this. One of the other applications that was being done at the time is actually to the particle physics problem from other people at Berkeley Lab. And, again, it works. And so the top plot is a validation set of images that weren't used in the training. And the bottom is the generated images. I mean, the first time cosmologists saw this, they were kind of pretty well surprised that they couldn't tell the difference between them, because they weren't really expecting us to be able to do so well that they wouldn't be able to tell by eye. But you certainly can't tell by eye. But one of the advantages of working with this in science, as opposed to celebrity faces, is that we do actually have good metrics for determining whether we've done well enough. So the top right is the power spectrum, which is something often used by cosmologists, but it's just the Fourier transform of a two-point correlation. So it sort of represents the Gaussian fluctuations in the plot. And here you can see the black is the validation, and the pink is the GAN. And so it not only agrees on the mean, this kind of middle line, but it also captures the distribution of this variable. And it's not just two-point correlations. But the plot on the right shows something called the Minkowski functional, which is a form with a three-point correlation. So even non-Gaussian structures in these maps are reproduced. And the important point, I guess, is that you could just sample from these distributions and reproduce those well. But all this was trained on was to reproduce the images. And it got these kind of structures, the physics that's important, right? OK, so this is very promising. But obviously, the holy grail of this is really to be able to do this for different values of the initial cosmology, sort of parameterized generation, and that's what we're working towards now. OK, so I mentioned throughout that we have these big computers. So another part of what we're trying to do is use extreme computing scales in a data parallel way to train these things faster, so not just different hyperparameters, which, of course, people also do, is to train different hyperparameters on different nodes in our computer, but just to train also one model quicker. And we've done that for all of these examples here so that the one on the left is the LHC CNN, which I described first. So I should say also that we have a large Cray machine that has an optimized high-performance network. But that network has been particularly optimized for HPC simulations and stuff and MPI-based architectures. So really it's been a huge gain for this kind of work that these plug-ins now exist that allow you to do MPI-distributed training, such as Horovod from Uber. And also Cray have developed a machine-learning plug-in here. So we used that for this problem. And you can basically see that all the lines kind of follow the ideal scaling up to thousands of nodes. And so you can really process more data faster. So there's quite a bit of engineering work that goes into this, of course. And this is shown in the papers here. For the LHC CNN, there's still a gap between the ideal. And that's partly because it's not such a computationally intensive model. So I/O and stuff becomes more important. For the CosmoGAN on the right, that's the GAN example. And there it is a bit more computationally intensive. So it does follow the scaling curve here. And CosmoFlow I just put in the middle here. I mean, here we show that I/O was important. But we were able to exploit something we have, which is this burst buffer, which is a layer of SSDs that sits on the high-speed network. And we were much more able to scale up to the scale of the full machine, 8,000 nodes using that than the shared disc space file system. OK, so another area that I just wanted to briefly mention at the end is that we have these fancy supercomputers, but as I mentioned, this takes a lot of engineering. And it was projects that we work particularly with. Something that we really want to do is allow people to use the supercomputer scale of kind of deep learning via Jupyter notebooks, which is really how scientists prefer to interact with our machines these ways. So we do provide JupyterHub at NERSC. But generally what people are running sits on a dedicated machine outside the main compute infrastructure. But what we did here was to enable them to run stuff actually on the supercomputing machine, either using ipyparallel or Dask, which are tools used for distributed computing, but interfacing with this via the notebook, and where a lot of the heavy computation actually occurs with the MPI backend in Horovod and stuff. So we were able to show that you can scale to large scales here without adding any extra overhead from being able to interact. And then, of course, you can add nice Jupyter things, like widgets and buttons, so that you can run different hyperparameter trials and sort of click on them and display them in the pane there. OK, so I just have my conclusion now. Basically deep learning, particularly in combination with high-performance computing, but productive software like TensorFlow, can really accelerate science. And we've seen that in various examples. I only mentioned a few here. But it requires developments, not only in methods. So we have other projects where we're working with machine-learning researchers to develop new methods. Also, there's different ways of applying these, of course. And we're well-placed to do that. And also well-placed to do some of the computing work. But it can really benefit from collaboration, I think, between scientists and the industry, which is sort of better represented here. And we've had a good relationship with Google there. But I think we can also do that with others. So basically my last slide is a call to help. If you have any questions, but also if you have ideas or collaborations or want to work with us on these problems, that would be good to hear about. Thanks. [APPLAUSE] AUDIENCE: [INAUDIBLE] WAHID BHIMJI: Yeah. Well, what was I going to say about supersymmetry? AUDIENCE: [INAUDIBLE] [LAUGHS] WAHID BHIMJI: Well, yeah, I don't think it's real. But no, I guess I was going to say whether or not-- so this example isn't supersymmetry. I can tell you that. But yeah, part of the point of this network, I guess, is it vastly improves our sensitivity to supersymmetry. So hopefully it could help answer the question if it's real or not. But yeah, certainly there's no current evidence at the Large Hadron Collider, so that's why we keep looking. And also I would say that some of these approaches also might help you look in different ways, for example, not being so sensitive to the model that theorists have come up with. So this approach really trains on simulated samples of a particular model that you're looking out for. But some of the ways that we're trying to extend this is to be more of a sort of anomaly detection, looking for things that you might not have expected. AUDIENCE: [INAUDIBLE] WAHID BHIMJI: Yeah, so the thing with these big collaborations that work at the LHC is they're very sensitive about you working on the real data. So, for example, someone who's coming from outside can't just apply this model on the real data. You have to work within the collaboration. So the reason why they don't use things like this in the collaboration at the moment is partly just because of the kind of rigor that goes into cross-validating and checking that all these models are correct. And that just takes a bit of time for it to percolate. There's a little bit of a skepticism amongst certain people about new ideas, I guess. And so I think the fact that this has now been demonstrated in several studies sort of adds a bit of mitigation to that. But then the other reason, I think, is that there's another practical reason, which is that this was exploiting kind of the full raw data. And apart from-- there are sort of practical reasons why you wouldn't want to do that, because these are large hundreds of petabyte datasets. And so filtering, in terms of these high-level physics variables, is not only done because they think they know the physics, but also because of practicality of storing the data. So those are a bunch of reasons. But then I gave another sort of technical reason, which is that you might be more sensitive to this, to mismodeling in the training data. And they really care about systematic uncertainties and so forth. And those are difficult to model when you don't know what it is you're trying to model for, I guess, because you don't know what the network necessarily picked up on. And so there's various ways to mitigate some of these technical challenges, including sort of mixing of data with simulated data, and also using different samples that don't necessarily have all the same modeling effects, and things like that. So there's a part that's cultural, but there's also a part that's technical. And so I think we can try and address the technical issues. AUDIENCE: [INAUDIBLE] WAHID BHIMJI: Not directly this model. But yeah, there is certainly a huge range of work. So we mostly work with the Department of Energy science. And there's projects across all of here. So, I mean, we can only work in depth, I guess, with a few projects. So we have a sort of few projects where we work in depth. But really part of what our group does at NERSC is also to make these tools like TensorFlow working at scale on the computer available to the whole science community. And so we're having more and more training events, and the community are picking up-- different communities are picking this up more by themselves. And that really allows-- so there certainly will be groups working with other parts of physics, I think. AUDIENCE: [INAUDIBLE] WAHID BHIMJI: Why is deep learning, right? I mean, it's not really the only approach. So like this actual diagram doesn't just refer to deep learning. So some of these projects around the edge here are not using deep learning. My talk was just about a few deep learning examples. So it's not necessarily the be all and end all. I think one of the advantages is that there is a huge amount of development work in methods in things like convolution neural networks and stuff that we can exploit. Relatively off the shelf, they're already available in tools like TensorFlow and stuff. So I mean, I think that's one advantage. The other advantage is there is quite a large amount of data. There are nonlinear features and stuff that can be captured by these more detailed models and so forth. So I don't know. It's not necessarily the only approach. AUDIENCE: [INAUDIBLE] WAHID BHIMJI: Yeah, so, I mean, there certainly are cases where the current models are being well tuned, and they are right. But there's many cases where you can get an advantage. I mean, you might think that the physics sections here, they're well tuned over certain years. They know what the variables are to look for. But there is a performance gain. And it can just be from the fact that you're exploiting more information in the event that you were otherwise throwing away, because you thought you knew everything that was important about those events, but actually there was stuff that leaked outside, if you like, the data that you were looking at. So there are gains to be had I think, yeah.
B1 data detector cosmology kind deep learning computing Deep learning for fundamental sciences using high-performance computing (O’Reilly AI Conference) 1 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary