Placeholder Image

Subtitles section Play video

  • WAHID BHIMJI: OK, so I'm Wahid.

  • I'm not actually part of Google.

  • I'm at Lawrence Berkeley National Lab.

  • And I'm sort of going to tell you

  • how we're using TensorFlow to perform

  • deep learning for the fundamental sciences

  • and also using high-performance computing.

  • OK, so the fundamental sciences--

  • particle physics, cosmology, and lots

  • of other things I'll explain in a minute--

  • make heavy use of high-performance computing

  • at the center I work at, which is

  • part of the Department of Energy,

  • initially for simulation and data analysis,

  • or traditionally.

  • But progress in deep learning and tools like TensorFlow

  • have really enabled the use of higher-dimensional data

  • and through deep learning enable the possibility

  • of new discoveries, faster computation, and actually

  • whole new approaches.

  • And I'm going to talk about that here,

  • illustrating with a few examples of stuff

  • we're running at NERSC.

  • So what is NERSC?

  • It's the high-performance computing center

  • for the Department of Energy Office for Science,

  • which means we support the whole breadth of science

  • that the DoE does, which actually includes things like

  • not just cosmology or what you might think of as energy

  • research, like batteries and so forth, but also

  • materials and climate and genomics and things like this.

  • So we have a huge range of users and a vast range

  • of projects across a whole variety of science.

  • And we have big machines.

  • Cori, our latest machine was number five, and, in fact,

  • the highest by P flops in the US when it was installed

  • a couple of years ago.

  • But things dropped down those numbers,

  • and now it's number 10 in the top 500.

  • OK, so we see the use of AI now across the whole science

  • domain.

  • So you probably can't see this slide very well,

  • but this is a take on an industry part of machine

  • learning, sort of splitting it into supervised learning

  • and unsupervised learning, and classification and regression,

  • and so forth.

  • And there's kind of examples here

  • that we see across the sciences.

  • But I'll be mostly talking about particle physics and cosmology

  • because actually that's my background.

  • So I'm more comfortable with those examples.

  • So what we're trying to do in these fields

  • is really uncover the secrets of the universe.

  • So this is a sort of evolution from the Big

  • Bang to the present day.

  • And there's planets and galaxies and stuff.

  • And they've all been influenced by, for example,

  • dark matter and dark energy over this evolution.

  • So obviously our understanding of this

  • has come a long way in recent years.

  • But there's still plenty of mysteries and things

  • that we don't know about, like the very things I just

  • mentioned, like dark matter.

  • What is the nature of dark matter?

  • And what is the relationship between particle physics that

  • explains extremely well the very small, and even

  • everything around us, but yet, breaks down

  • at cosmological scales?

  • So in order to answer those kind of questions,

  • we have huge complex instruments,

  • such as the planned LSST telescope on the left,

  • which is going to look at the very big in resolutions

  • that currently I'm presenting.

  • And the ATLAS Detector at the LHC

  • on the right, which is at the Large Hadron Collider,

  • so at the Swiss, French border, it's

  • a detector the size of a building.

  • There's little people there.

  • You can sort of see them.

  • And it has hundreds of millions of channels of electronics

  • to record collisions that occur in the middle every 25

  • nanoseconds, so a huge stream of data.

  • So both of these experiments have vast streams of data.

  • Really the ATLAS experiment has processed exabytes

  • of data over its time.

  • And this has to be filtered through a process of data

  • analysis and so forth.

  • And so if you get like high-resolution images or very

  • detailed detector outputs, the first stage

  • is to kind of simplify these, maybe

  • build catalogs of objects in the sky,

  • such as stars and galaxies, or in the case of particle physics

  • to kind of combine these into what particles might have been

  • produced, and these lines or tracks and deposits that

  • have occurred in the detector.

  • So this obviously also involves a large amount of computing.

  • So computing fits in here.

  • But computing also fits in because the way

  • that these analyses are done is to compare with simulated data,

  • in this case cosmology simulations that

  • are big HPC simulations done for different types of universes

  • that might have existed depending

  • on different cosmology parameters.

  • And in the particle physics case,

  • you do extremely detailed simulations

  • because of the precision you require

  • in terms of how the detector would

  • have reacted to, for example, a particle coming in here.

  • So here you've got all this kind of showering

  • of what would have happened inside the detector.

  • And then from each of these, you might

  • produce summary statistics and compare one to the other

  • and what are secrets of the universe,

  • I guess, such as the nature of dark matter,

  • or new particles at the LHC.

  • OK, so you might have guessed there's

  • many areas where deep learning can help with this.

  • So one is classification to, for example,

  • find those physics objects that I showed identified

  • in collisions at the LHC.

  • Or, indeed, just to directly find

  • from the raw data what was interesting and what

  • was not interesting.

  • Another way you might use it is regression to, for example,

  • find what kind of energies were deposited inside the detector

  • or what were the physics parameters that

  • were responsible for stuff that you saw in the images

  • at the telescopes.

  • Another is sort of clustering feature detection

  • in a more unsupervised way, where

  • you might want to look for anomalies in the data,

  • either because these are signs of new physics

  • or because they're actually problems with the instruments.

  • And then another last way, perhaps,

  • is to generate data to replace the full physics simulations

  • that I just described, which are, as I mentioned,

  • extremely computationally expensive.

  • So I'll give a few examples across these domains, OK?

  • So the first is classification.

  • So there you're trying to answer the question,

  • is this new physics?

  • For example, supersymmetry, which

  • is also a dark matter candidate.

  • So I could give you the answer later

  • whether that's actually new physics or not.

  • So here the idea is kind of be--

  • there's several papers now that are exploiting

  • this kind of idea is to take the detector,

  • like the ATLAS detector, and sort of unroll--

  • so it's a cylindrical detector with many cylindrical layers,

  • and to take that and unroll it into an image where

  • you have phi along here, which is

  • the angle around this direction, and eta this way, which

  • is a physicist's way of describing

  • the forwardness of the deposit.

  • And then just simply put it in an image,

  • and then you get the ability to use

  • all of the kind of developments in image recognition,

  • such as convolutional neural networks.

  • So here's what is now a relatively simple sort

  • of convolutional network, several convolution and pooling

  • layers, and then fully connected layers.

  • And we've exploited this at scales,

  • either taking 64 by 64 images to represent

  • this, or at the 224 by 224, sort of closer

  • to the resolution of the detector.

  • And then you can also have multiple channels

  • which correspond to different layers

  • of this cylindrical detector.

  • And we saw that it kind of works.

  • So here you have a rock curve of the two class probability,

  • so false positive rate.

  • So the first thing to notice is that you

  • need a very, very high rejection of the physics

  • that you already know about because that vastly dominates.

  • So actually, the data's even been prefiltered before this.

  • So you need a low false positive rate.

  • And generally, of course, in these rock curves,

  • higher this way is better.

  • So this point represents the physics selections

  • that were traditionally used in this analysis.

  • These curves here, like incorporating those higher

  • level physics variables.

  • But in shallow neural net-- in shallow machine

  • learning approaches.

  • And then the blue line shows that you

  • can gain quite a lot, and this from using

  • convolution neural networks, which gives you access.

  • It's not just the technique of deep learning,

  • but also being able to use all the data that's available.

  • And then, similarly, there's another boost

  • from using three channels, which corresponds

  • to the other detectors.

  • OK, so I'd just like to caveat a few of these results.

  • So even though there's obvious gains to be had here,

  • these analyses are still not currently used

  • in the Large Hadron Collider analyses.

  • And part of the reason for that is

  • because this is trained on simulation, whereas--

  • and it might pick up on very small aspects of the simulation

  • that differ from the real data.

  • So there's several work that's carrying on

  • from this to look at how to incorporate

  • real data into what's done.

  • But also, I think methodological developments

  • in how to interrogate these models,

  • and really discover what they're learning

  • would be kind of useful for the field.

  • OK, so taking a regression problem now,

  • the kind of question you might ask

  • is, what possible universe would look like this?

  • So this is an image from the Sloan Digital Sky Survey.

  • So here the Earth's in the middle,

  • and increasing redshift takes you out this way.

  • So this is like the older universe over here.

  • And you can kind of see.

  • And the histogram is a structure of galaxies' density.

  • So there's like more-- you can see

  • that structure sort of appears as the universe evolves.

  • And that kind of evolution of structure

  • tells you something about the cosmology parameters

  • that were involved.

  • So can you actually regress those parameters

  • from looking at these kind of distributions?

  • So the idea here was to take a method that

  • was developed by these people in CMU, which is to use, again,

  • a convolutional neural network.

  • But here, running on a 3D--

  • this is actually simulated data, so 3D distribution

  • of dark matter in the universe.

  • And these can be large datasets.

  • So part of the work we did here was to kind of scale this up

  • and run on extremely large data and across the kind of machines

  • we have at Cori on 8,000 CPU nodes,

  • which is kind of some of the largest scale

  • that TensorFlow has been run on in a data parallel fashion,

  • and then being able to predict cosmology parameters that

  • went into this simulation in a matter of minutes because

  • of the scale of computation used.

  • So, again, it's a fairly standard network,

  • but this time in 3D.

  • And actually, there was quite a bit

  • of work to get this to work well on CPU, for example.

  • Because that's-- sorry, I should have mentioned, actually,

  • that the machine that we have is primarily composed of Intel

  • Knights Landing CPU.

  • So that's another area that we help develop.

  • OK, so again, it kind of works.

  • And so here they like to plot the true value

  • of the parameter that's being regressed

  • against the predicted value.

  • So you hope that it would match up along the line.

  • And, indeed, the points which come

  • from a run of this regression do actually

  • lie across this line in all the cases.

  • Now, the actual crosses come from the larger-scale run.

  • So the points come from a 2,000 node run of the network,

  • and the crosses come from an 8,000 node run.

  • So the sort of caveat here is that the 8,000 node run doesn't

  • actually do as well as the 2,000 node run, which kind of

  • indicates a point, which is another thing we're working on

  • about convergence of these neural networks

  • when running at a large distributed scale.

  • So I'll come back to that a bit later as well.

  • OK, so then the last example I have here is around generation.

  • So, as I mentioned, simulations went

  • into that previous analysis, and they

  • go into all of the kind of analyses

  • that are done at these--

  • for these kind of purposes.

  • But in order to get some of these simulations,

  • it actually takes two weeks of computational time

  • on a large supercomputer of the scale of Cori,

  • which is kind of a huge amount of resource

  • to be giving over to these.

  • And for different versions of the cosmology

  • that you want to generate, you need

  • to do more of these simulations.

  • And part of the output that you might get from these

  • is, this is a 2D mass map, which corresponds

  • to the actual galaxy distribution

  • that you might observe from data.

  • So this is what you would want to compare with data.

  • So the question is, is it possible to augment your data

  • set and generate different cosmologies

  • in a kind of fast simulation that doesn't require running

  • this full infrastructure?

  • And the way that we tried to do that

  • was to use generative adversarial networks that

  • were mentioned briefly in the last talk, where you have

  • two networks that work in tandem with each other

  • and that are optimized together, but against each other,

  • and one is trying to tell the difference between.

  • So the advantage here is that we have some real examples

  • from the full simulation to compare with.

  • And the discriminator can take those and say,

  • does the generator do any good job of producing fake maps?

  • And we used a pretty standard DCGAN architecture.

  • And there was a few modifications in this paper

  • to make that work.

  • But at the time last year there weren't many people

  • trying to apply this.

  • One of the other applications that was being done at the time

  • is actually to the particle physics

  • problem from other people at Berkeley Lab.

  • And, again, it works.

  • And so the top plot is a validation set

  • of images that weren't used in the training.

  • And the bottom is the generated images.

  • I mean, the first time cosmologists

  • saw this, they were kind of pretty well surprised

  • that they couldn't tell the difference between them,

  • because they weren't really expecting

  • us to be able to do so well that they wouldn't

  • be able to tell by eye.

  • But you certainly can't tell by eye.

  • But one of the advantages of working with this in science,

  • as opposed to celebrity faces, is

  • that we do actually have good metrics for determining

  • whether we've done well enough.

  • So the top right is the power spectrum,

  • which is something often used by cosmologists,

  • but it's just the Fourier transform

  • of a two-point correlation.

  • So it sort of represents the Gaussian fluctuations

  • in the plot.

  • And here you can see the black is the validation,

  • and the pink is the GAN.

  • And so it not only agrees on the mean, this kind of middle line,

  • but it also captures the distribution of this variable.

  • And it's not just two-point correlations.

  • But the plot on the right shows something

  • called the Minkowski functional, which

  • is a form with a three-point correlation.

  • So even non-Gaussian structures in these maps are reproduced.

  • And the important point, I guess,

  • is that you could just sample from these distributions

  • and reproduce those well.

  • But all this was trained on was to reproduce the images.

  • And it got these kind of structures, the physics that's

  • important, right?

  • OK, so this is very promising.

  • But obviously, the holy grail of this

  • is really to be able to do this for different values

  • of the initial cosmology, sort of parameterized generation,

  • and that's what we're working towards now.

  • OK, so I mentioned throughout that we

  • have these big computers.

  • So another part of what we're trying to do

  • is use extreme computing scales in a data parallel way

  • to train these things faster, so not just

  • different hyperparameters, which, of course,

  • people also do, is to train different hyperparameters

  • on different nodes in our computer,

  • but just to train also one model quicker.

  • And we've done that for all of these examples

  • here so that the one on the left is the LHC

  • CNN, which I described first.

  • So I should say also that we have a large Cray

  • machine that has an optimized high-performance network.

  • But that network has been particularly optimized

  • for HPC simulations and stuff and MPI-based architectures.

  • So really it's been a huge gain for this kind of work

  • that these plug-ins now exist that allow

  • you to do MPI-distributed training, such as Horovod

  • from Uber.

  • And also Cray have developed a machine-learning plug-in here.

  • So we used that for this problem.

  • And you can basically see that all the lines kind of

  • follow the ideal scaling up to thousands of nodes.

  • And so you can really process more data faster.

  • So there's quite a bit of engineering work

  • that goes into this, of course.

  • And this is shown in the papers here.

  • For the LHC CNN, there's still a gap between the ideal.

  • And that's partly because it's not

  • such a computationally intensive model.

  • So I/O and stuff becomes more important.

  • For the CosmoGAN on the right, that's the GAN example.

  • And there it is a bit more computationally intensive.

  • So it does follow the scaling curve here.

  • And CosmoFlow I just put in the middle here.

  • I mean, here we show that I/O was important.

  • But we were able to exploit something we have,

  • which is this burst buffer, which is a layer of SSDs that

  • sits on the high-speed network.

  • And we were much more able to scale up

  • to the scale of the full machine, 8,000 nodes using

  • that than the shared disc space file system.

  • OK, so another area that I just wanted to briefly mention

  • at the end is that we have these fancy supercomputers,

  • but as I mentioned, this takes a lot of engineering.

  • And it was projects that we work particularly with.

  • Something that we really want to do

  • is allow people to use the supercomputer scale of kind

  • of deep learning via Jupyter notebooks,

  • which is really how scientists prefer to interact

  • with our machines these ways.

  • So we do provide JupyterHub at NERSC.

  • But generally what people are running

  • sits on a dedicated machine outside the main compute

  • infrastructure.

  • But what we did here was to enable them to run stuff

  • actually on the supercomputing machine, either

  • using ipyparallel or Dask, which are tools used for distributed

  • computing, but interfacing with this via the notebook,

  • and where a lot of the heavy computation

  • actually occurs with the MPI backend in Horovod and stuff.

  • So we were able to show that you can scale to large scales

  • here without adding any extra overhead from being

  • able to interact.

  • And then, of course, you can add nice Jupyter things,

  • like widgets and buttons, so that you

  • can run different hyperparameter trials and sort of click

  • on them and display them in the pane there.

  • OK, so I just have my conclusion now.

  • Basically deep learning, particularly

  • in combination with high-performance computing,

  • but productive software like TensorFlow,

  • can really accelerate science.

  • And we've seen that in various examples.

  • I only mentioned a few here.

  • But it requires developments, not only in methods.

  • So we have other projects where we're

  • working with machine-learning researchers

  • to develop new methods.

  • Also, there's different ways of applying these, of course.

  • And we're well-placed to do that.

  • And also well-placed to do some of the computing work.

  • But it can really benefit from collaboration, I think,

  • between scientists and the industry, which is sort

  • of better represented here.

  • And we've had a good relationship with Google there.

  • But I think we can also do that with others.

  • So basically my last slide is a call to help.

  • If you have any questions, but also

  • if you have ideas or collaborations

  • or want to work with us on these problems,

  • that would be good to hear about.

  • Thanks.

  • [APPLAUSE]

  • AUDIENCE: [INAUDIBLE]

  • WAHID BHIMJI: Yeah.

  • Well, what was I going to say about supersymmetry?

  • AUDIENCE: [INAUDIBLE]

  • [LAUGHS]

  • WAHID BHIMJI: Well, yeah, I don't think it's real.

  • But no, I guess I was going to say whether or not--

  • so this example isn't supersymmetry.

  • I can tell you that.

  • But yeah, part of the point of this network, I guess,

  • is it vastly improves our sensitivity to supersymmetry.

  • So hopefully it could help answer the question

  • if it's real or not.

  • But yeah, certainly there's no current evidence

  • at the Large Hadron Collider, so that's why we keep looking.

  • And also I would say that some of these approaches

  • also might help you look in different ways, for example,

  • not being so sensitive to the model

  • that theorists have come up with.

  • So this approach really trains on simulated samples

  • of a particular model that you're looking out for.

  • But some of the ways that we're trying to extend this

  • is to be more of a sort of anomaly detection,

  • looking for things that you might not have expected.

  • AUDIENCE: [INAUDIBLE]

  • WAHID BHIMJI: Yeah, so the thing with these big collaborations

  • that work at the LHC is they're very sensitive about you

  • working on the real data.

  • So, for example, someone who's coming from outside

  • can't just apply this model on the real data.

  • You have to work within the collaboration.

  • So the reason why they don't use things

  • like this in the collaboration at the moment

  • is partly just because of the kind of rigor

  • that goes into cross-validating and checking

  • that all these models are correct.

  • And that just takes a bit of time for it to percolate.

  • There's a little bit of a skepticism

  • amongst certain people about new ideas, I guess.

  • And so I think the fact that this has now

  • been demonstrated in several studies sort of

  • adds a bit of mitigation to that.

  • But then the other reason, I think,

  • is that there's another practical reason, which

  • is that this was exploiting kind of the full raw data.

  • And apart from-- there are sort of practical reasons

  • why you wouldn't want to do that,

  • because these are large hundreds of petabyte datasets.

  • And so filtering, in terms of these high-level physics

  • variables, is not only done because they

  • think they know the physics, but also

  • because of practicality of storing the data.

  • So those are a bunch of reasons.

  • But then I gave another sort of technical reason, which

  • is that you might be more sensitive to this,

  • to mismodeling in the training data.

  • And they really care about systematic uncertainties

  • and so forth.

  • And those are difficult to model when

  • you don't know what it is you're trying to model for, I guess,

  • because you don't know what the network necessarily picked up

  • on.

  • And so there's various ways to mitigate

  • some of these technical challenges,

  • including sort of mixing of data with simulated data,

  • and also using different samples that don't necessarily

  • have all the same modeling effects, and things like that.

  • So there's a part that's cultural,

  • but there's also a part that's technical.

  • And so I think we can try and address the technical issues.

  • AUDIENCE: [INAUDIBLE]

  • WAHID BHIMJI: Not directly this model.

  • But yeah, there is certainly a huge range of work.

  • So we mostly work with the Department of Energy science.

  • And there's projects across all of here.

  • So, I mean, we can only work in depth,

  • I guess, with a few projects.

  • So we have a sort of few projects

  • where we work in depth.

  • But really part of what our group does at NERSC

  • is also to make these tools like TensorFlow working

  • at scale on the computer available to the whole science

  • community.

  • And so we're having more and more training events,

  • and the community are picking up--

  • different communities are picking this up more

  • by themselves.

  • And that really allows--

  • so there certainly will be groups

  • working with other parts of physics, I think.

  • AUDIENCE: [INAUDIBLE]

  • WAHID BHIMJI: Why is deep learning, right?

  • I mean, it's not really the only approach.

  • So like this actual diagram doesn't just

  • refer to deep learning.

  • So some of these projects around the edge

  • here are not using deep learning.

  • My talk was just about a few deep learning examples.

  • So it's not necessarily the be all and end all.

  • I think one of the advantages is that there

  • is a huge amount of development work in methods in things

  • like convolution neural networks and stuff that we can exploit.

  • Relatively off the shelf, they're

  • already available in tools like TensorFlow and stuff.

  • So I mean, I think that's one advantage.

  • The other advantage is there is quite a large amount of data.

  • There are nonlinear features and stuff

  • that can be captured by these more detailed models

  • and so forth.

  • So I don't know.

  • It's not necessarily the only approach.

  • AUDIENCE: [INAUDIBLE]

  • WAHID BHIMJI: Yeah, so, I mean, there certainly

  • are cases where the current models are being well tuned,

  • and they are right.

  • But there's many cases where you can get an advantage.

  • I mean, you might think that the physics sections here, they're

  • well tuned over certain years.

  • They know what the variables are to look for.

  • But there is a performance gain.

  • And it can just be from the fact that you're

  • exploiting more information in the event

  • that you were otherwise throwing away, because you thought

  • you knew everything that was important about those events,

  • but actually there was stuff that

  • leaked outside, if you like, the data that you were looking at.

  • So there are gains to be had I think, yeah.

WAHID BHIMJI: OK, so I'm Wahid.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it