Placeholder Image

Subtitles section Play video

  • JOSH DILLON: Hi, I'm Josh, and today going

  • to be talking about TensorFlow Probability, which

  • is a project I've been working on for two years now.

  • So what is TensorFlow Probability?

  • So we are part of the TensorFlow ecosystem.

  • It's a library built using TensorFlow,

  • and the idea is to make it easy to combine deep learning

  • with probabilistic modeling.

  • We are useful for statisticians and data scientists

  • to whom we can provide R-like capabilities, which

  • take advantage of GPU and TPU, and to ML researchers

  • and practitioners so you can build deep models, which

  • capture uncertainty.

  • So why should you care?

  • So a neural net that predicts binary outcomes

  • is just a Bernoulli distribution that's parameterized

  • by something fancy.

  • So suppose you have this.

  • You've got your sort of V1 model.

  • Looks great.

  • Now what?

  • That's where TensorFlow Probability can help you out.

  • Using our software, you can encode additional information

  • in your problem.

  • You can control prediction variance.

  • You can even possibly ask tougher questions.

  • No longer assume that pixels are independent,

  • because guess what?

  • They're not.

  • This is what we're going to be talking about.

  • So the main take home message for this talk

  • is TensorFlow Probability is a bunch of low level

  • tools, a collection of low level tools

  • which are aimed at trying to make it easier

  • for you to express what you know about your problem.

  • To not try to shoehorn your problem

  • into a neural net architecture, but rather

  • describe what you know and take advantage of what you know.

  • And these sort of images over here, we'll talk about a few

  • of them, but each of them represents

  • a part of the TensorFlow Probability package.

  • OK, so in the simplest form, how would you

  • use TensorFlow Probability?

  • Sort of like a get our feet wet type example.

  • So we offer generalized linear models.

  • Think logistic regression, linear regression.

  • Very boring stuff maybe, but it's a good starting point.

  • So you'll see this pattern throughout the TensorFlow

  • Probability software stack and how you use it.

  • But basically, you specify a model,

  • in this case, Bernoulli corresponding logistic

  • regression.

  • And then you just fit it.

  • And in this case, we're using L1 regularization and L2

  • so you can get sparse weights.

  • And why you should care about this

  • is it's using a second order solver under the hood, which

  • means that up to floating point precision,

  • you would never need more than 30 iterations of this.

  • And in practice, maybe three or four is all it takes.

  • And since you can take advantage of GPU,

  • it's like a drop-in replacement, it

  • takes advantage of GPU drop-in replacement

  • for, say, R in this case.

  • So that's just kind of the canned,

  • like an example of some of the canned stuff we offer.

  • Where things really get exciting are

  • this sort of suite of tools.

  • So first we are going to talk about distributions,

  • which are probably what you think they are.

  • We'll also talk about bijectors in this talk.

  • TensorFlow Probability provides probabilistic layers,

  • things that wrap up variational inference

  • with different distributional assumptions.

  • We have a probabilistic programming language, which

  • is the successor of Edward.

  • That's also part of the TensorFlow Probability package.

  • And then on the inference side-- that's

  • kind of for building models.

  • On the inference side, we've got a collection

  • of Markov chain Monte Carlo transition kernels

  • and tools to use them, diagnostic criteria, that sort

  • of thing; tools for variational inference in a numerically

  • stable way; and various optimizers,

  • like stochastic gradient monument descent, BFGS,

  • [INAUDIBLE],, sort of the stuff not stochastic gradient

  • descent, maybe some of which are more useful for single machine

  • settings, others baking in probability with optimization.

  • OK, so a distribution, I hope this is boring because nothing

  • here should be really fancy.

  • Capability of drawing samples, you

  • can compute probabilities, CDF, 1 minus the CDF,

  • mean, variance, all the usual stuff.

  • Little more interesting here at the bottom.

  • The event shape and, you can't see it,

  • but it says batch shape.

  • So TensorFlow Probability distributions--

  • to take advantage of vectorized [INAUDIBLE] hardware,

  • you specify--

  • you call the distribution once, but you

  • specify multiple parameters.

  • So here's an example.

  • We're building a normal, but we're passing two location

  • parameters.

  • So when you call sample on this, it's

  • going to return two samples every one time you call

  • sample, if that makes sense.

  • One will correspond to the normal distribution

  • parameterized with mean minus 1, the other with mean 1.

  • It turns out this very simple idea is extremely powerful

  • and lets you immediately take advantage

  • of vector computation.

  • So not only do distributions have

  • this sort of like small tweak from other libraries

  • or packages, but we've got a bunch of them

  • and you can combine them in interesting ways.

  • So it's not super important what distribution this is.

  • The point is we're making a mixture,

  • combining categorical distributions

  • with multivariate normal with a diagonal parameterization,

  • and it all just kind of fits together,

  • and you can do cool things using simple building blocks.

  • And that's a theme that's pervasive in TensorFlow

  • Probability-- simple ideas scaled up

  • to be a powerful framework and formalism.

  • So here's another example of a distribution we have,

  • Gaussian Processes.

  • I think this is cool because in a few lines,

  • you can learn uncertainty.

  • So notice that the model has sort

  • of different beliefs in areas where there's no data

  • and it's tight where there is.

  • You could easily turn this into a layer in your neural net

  • if you wanted to.

  • OK, so distributions, there's a bunch of them.

  • They have these sort of batch semantics.

  • They're cool.

  • Onto our second building block, bijectors.

  • So a bijector is useful for transforming a random variable.

  • It is, think like log and x.

  • You may on the forward transformation

  • take the exponential of some random variable,

  • and then to reverse it you take the logarithm.

  • So the forward is useful for computing samples

  • and the inverse is useful for computing probabilities.

  • So a bijector is a bijective diffeomorphism,

  • a differentiable isopmorphism between two spaces.

  • And those spaces represent sort of an input random variable

  • and an output random variable.

  • And because we're interested in computing probabilities,

  • we have to keep track of the Jacobian.

  • So just change your variables and an integral.

  • And so that's what this implements.

  • We also have the notion of shape.

  • Because here, again, everything supports these sort

  • of batch shape semantics.

  • So what would you use a bijector for?

  • So behind the slide is an amazing idea.

  • You can take a neural net and use

  • it to transform any distribution you want,

  • and sort of get an arbitrarily rich distribution.

  • So this little piece of code here

  • really is just two hidden layers, two dense hidden layer

  • neural net.

  • And then it's wrapped up inside this autoregressive flow

  • bijector, which transforms a normal.

  • Now, here's why this is amazing.

  • You could plug this in as your loss in the output.

  • Like this could be your loss basically, the final line here.

  • Whoops.

  • I shouldn't have done that.

  • On the final line is just this distribution.logprob.

  • That's an arbitrarily rich distribution

  • capable of learning variance, not

  • prescribed by like a Bernoulli.

  • The variance is p times 1 minus p.

  • And unless your data actually is generated

  • by Bernoulli distribution, that's

  • a fairly restrictive assumption, in part because anytime that's

  • not the case, it's very sensitive to mis-specification.

  • So this is like a much richer family.

  • And it sort of combines immediately neural nets

  • and distributions.

  • So another cool thing, you can reverse bijectors.

  • And this little one line change was a whole other paper.

  • And we see this phenomenon in TensorFlow probability a lot.

  • Because everything's low level and modular, one little change,

  • brand new idea.

  • OK.

  • So that's kind of some background.

  • Let's go through an example of how you might use this.

  • So this is from a book, "Bayesian Methods for Hackers,"

  • which we'll talk about at the end.

  • And the question is, so I guess the guy who wrote this book,

  • he got a girlfriend.

  • And at some point his text messaging frequency changed.

  • So the question is, can we find that in the data?

  • And maybe you'd guess 22 days.

  • Or maybe 40 some days.

  • I don't know.

  • Let's see.

  • So here's a simple model.

  • We'll posit that there was a rate

  • of text messages in some pre period

  • and a rate in some post period.

  • And the question is, was there a change over?

  • And that's the sort of math, or statistical program,

  • as I like to call it.

  • That statistical program translates

  • into TensorFlow probability in an almost one to one

  • way, exponential, uniform, flip it over, final Poisson.

  • And to compute the joint log prob,

  • we just add everything up in log space.

  • And using that we can sample from the posterior.

  • And so what we find is, yes, there

  • was one rate around 18 text messages a day I guess.

  • Another around 23.

  • And it turns out that the highest posterior probability

  • was on day 44.

  • So how did we get these posterior samples

  • from the joint log probability?

  • We used MCMC.

  • So our MCMC library has several transition kernels.

  • I think one of the more powerful ones

  • because it takes advantage of automatic differentiation

  • is Hamiltonian Monte Carlo.

  • And all we do to use that is take our joint log

  • problem, which you saw in the previous slide,

  • and just pin whatever you want to condition on.

  • So in this case, we're going to condition on count data.

  • And we want to sample the tau and the two lambdas, the rates

  • and the changeover point.

  • So we set this up.

  • Whoops.

  • We ask for some number of results,

  • burn in steps, or the usual MCMC business.

  • Something a little different here is this transformer.

  • The transformer takes a constrained random variable

  • and unconstrains it, because HMC is taking a gradient step.

  • And it may step out of bounds.

  • And so since the lambda terms are rates of a Poisson,

  • they need to be positive.

  • So the x bijector goes to and from positive real

  • to unconstrained real.

  • So too with tau.

  • That was on the 01 interval.

  • And so using sigmoid, which you can't see here,

  • we transformed to and from.

  • And day 44.

  • It turns out that really was when he started dating.

  • And so it seems like Bayesian inference was right.

  • OK.

  • So super hard graphical model, which we won't talk about.

  • But the point is, there's a whole lot of math here,

  • and it's really scary.

  • Not really.

  • Each line basically transforms one to one.

  • So you pull out some graphical model

  • from the literature before neural nets got really popular

  • again.

  • And you can code it up in TensorFlow probability.

  • And where things get amazing is you can actually

  • parametrize these distributions with a neural net,

  • thus getting the benefit of both.

  • And you can differentiate through the whole thing.

  • So it's really sort of what's old is new again, yet in a way

  • that you can take advantage of modern hardware.

  • So just one to one between math and TFP.

  • OK.

  • So we did see a little bit of the deep learning,

  • the masked R regressive flow.

  • And I mentioned you can re-parametrize stuff.

  • So here's sort of the idea of re-parametrization.

  • So, as we know, probabilistic graphical models

  • tend to be computationally very intensive.

  • Neural nets are really good at embedding data

  • into a lower dimensional space.

  • Why not take your complex, computationally intensive,

  • probabilistic graphical model and parametrize it

  • with a neural net?

  • And that's kind of what this slide is saying

  • we should think about doing.

  • So you've heard of GANs.

  • So variational auto encoders are kind

  • of the probabilistic analog of the GAN.

  • That's the adversarial networks trying

  • to fight each other to come up with a good balance.

  • It actually has a probabilistic sort of analog.

  • And this is it.

  • So in this case, the posterior distribution takes, say,

  • an image, and is a distribution over a low dimensional space

  • Z. And the likelihood is a distribution

  • that takes a low dimensional representation

  • and outputs back the image.

  • And using variational inference, which really just consists

  • of 10 lines of code, you can take

  • these different distributions, which

  • are themselves parametrized by neural nets,

  • and just fit it with Monte Carlo variational inference,

  • taking advantage of TensorFlow's automatic differentiation.

  • So it all kind of fits together nicely.

  • OK.

  • So that was a lot of information that we kind of breezed through

  • quickly.

  • We are in the process of rewriting

  • this "Bayesian Methods for Hackers"

  • book using TensorFlow probability.

  • It already exists.

  • I think there's like a PyMC version of it.

  • And so we've started all the chapters.

  • One and two are in the best shape.

  • So definitely start with those.

  • In chapter one you'll find the text message example.

  • But that's basically it.

  • So in conclusion, TensorFlow probability

  • helps you combine deep learning with probabilistic modeling

  • so you can encode additional domain

  • knowledge about your problem.

  • Pip install.

  • Easy to use.

  • And you can check it out as part of the TensorFlow ecosystem

  • to learn more.

  • Thanks.

  • And I've got a few minutes here for questions,

  • if anyone has any.

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • JOSHUA DILLON: Yeah.

  • So the question is, can I quantify uncertainty

  • in a neural net basically using this stuff?

  • And the answer is, absolutely yes.

  • That's why you would use this stuff.

  • In fact, the larger question of why would you even

  • use probabilistic modeling is probably because you

  • want to quantify uncertainty.

  • And so I pulled back to this variation autoencoder slide,

  • because what's happening is it's a little hard to see here,

  • cause it's just code, but this low dimensional space

  • is basically inducing uncertainty as a bottleneck.

  • And all of your neural nets do this.

  • Often you'll have a smaller, hidden layer,

  • going from a larger hidden layer,

  • to a smaller, back to a larger.

  • So the point with this is, just do that in a principled way.

  • Keep track of what you lose by sort of compressing it down.

  • And in so doing, then you actually get

  • a measure of how much you lost.

  • And so while this is a variational autoencoder,

  • the supervised learning sort of alternative to this

  • would be variational information bottleneck.

  • And the code for that is almost exactly the same.

  • The only difference is you're reconstructing a label

  • from some input x.

  • So you go from x, to z, to y.

  • So image, low dimensional, back to the thing

  • you're trying to predict.

  • OK.

  • So I'm out of time.

  • And with that, I will take it over to you.

JOSH DILLON: Hi, I'm Josh, and today going

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it