Placeholder Image

Subtitles section Play video

  • (upbeat ambient music)

  • - I'm hoping that I'm gonna tell you something

  • that's interesting and, of course,

  • I have this very biased view,

  • which is I look at things from my computational lens

  • and are there any computer scientists in the room?

  • I was anticipating not, but okay, there are,

  • so there's one, maybe every now

  • and then I'll ask you a question,

  • no, no, no, I'm just kidding, but, so,

  • and then so my goal here is gonna be to basically,

  • actually just give you a flavor of what is machine learning,

  • this is my expertise, and so just, actually,

  • again, to get a sense of who's in the room,

  • like, if I picked on someone here,

  • like raise your hand if you would be able to answer

  • that question, like, what is machine learning?

  • Okay, a handful, no, actually one, or two.

  • Great, okay, so I just want to give you a sense

  • of that, and I'm gonna, you know,

  • most of this is gonna be pretty intuitive,

  • I'll try to make little bits of it concrete

  • that I think will be helpful,

  • and then I'll tell you how we use machine learning

  • to improve guide designs, specifically

  • for knockdown experiments, but I think a lot

  • of it is probably useful for more than that,

  • but we haven't sort of gone down that route,

  • and so I can't say very much about that.

  • And please interrupt me if something doesn't make sense

  • or you have a question, I'd rather do

  • that so everybody can kind of stay on board rather

  • than some, you know, it makes less

  • and less sense the longer I go.

  • Alright, so machine learning, actually, during my PhD,

  • the big, one of the big flagship conferences was peaking

  • at around 700 attendees, and when I go now,

  • it actually is capped, like, it's sold out at 8,000 like,

  • months in advance, 'cause this field is just like,

  • taken off, basically it's now lucrative for companies,

  • and it's become a really central part of Google,

  • Microsoft, Facebook, and all the big tech companies,

  • so this field has changed a lot,

  • and kind of similar to CRISPR,

  • there's an incredible amount of hype and buzz

  • and ridiculous media coverage and

  • so it's a little bit funny, in fact,

  • that I'm not working at these two kind of,

  • very hyped up areas.

  • But anyway, so, you know,

  • people in just the mainstream press now,

  • you're always hearing about artificial intelligence

  • and deep neural networks, and so these are like,

  • so I would say machine learning is a sub-branch

  • of artificial intelligence,

  • and a deep neural network is sort

  • of an instance of machine learning, and so like,

  • what really is this, this thing?

  • So it kind of overlaps sometimes

  • with traditional statistics, but the,

  • like, in terms of the machinery,

  • but the goals are very different and,

  • but, really like the core, fundamental concept here is

  • that we're gonna sort of pause at some model, so maybe like,

  • think linear regression is a super simple model,

  • and you can like, expose it to data, it has some parameters,

  • right, the weights, and then we essentially want

  • to fit those weights, and that's the training,

  • that's literally the machine learning.

  • So I'm sorry if that sounds super simple

  • and not like, God-like, like machine learning

  • and everything working magically,

  • but that really is what it is,

  • and, right, and so let me just also give you like,

  • sort of drive home that point.

  • So we're gonna pause at some sort of model,

  • and so here I'm giving you the simplest example

  • because I think most people here work

  • with linear regression at some point in their life,

  • and so you can think of this as a predictive model

  • in the sense that if I give it a bunch

  • of examples of Y and X, and I learn the parameter of beta,

  • then for future examples where I don't have Y

  • but I only have X, I can just compute,

  • X times beta, and I get a prediction of why.

  • So that's the sense in which I call this a predictive model,

  • and that's very much how machine learning people tend

  • to think of it, where statisticians are often very focused

  • on what is beta, what are the confidence intervals

  • around beta and things like this.

  • So like, there's, that's the sense

  • in which there's a lot of overlap,

  • but the goals are kind of quite different.

  • We want to like, use real data

  • and make predictions, so here it's gonna be predictions

  • about guides, and which guides are effective

  • at cutting and at knockout.

  • Right, and so it has these free parameters,

  • and we call these things that we put in here features,

  • and so in the case of guide design,

  • the question is gonna be, what features are we gonna put

  • in there that allow us to make these kinds of predictions,

  • and, so I'm gonna get into that in a little bit,

  • but just as an example to make this concrete,

  • it might be how many GCs are in this 30mer guide,

  • or guide plus context.

  • Right, and like I said, we're gonna call,

  • we're gonna give it some data,

  • and so in this case, the data for guide design is gonna be

  • data from (mumbles), there's a community

  • that's now publicly available where there are examples,

  • for example, what the guide was

  • and how effective the knockout was,

  • or what the cutting frequency was.

  • For example, I get a good, a bunch of these examples,

  • and then that's gonna enable me

  • to somehow find a good beta, and of course we're not,

  • actually, we do sometimes use linear regression,

  • but I'll tell you a little bit more about,

  • more sort of complex and richer models

  • that let us do a lot more, and then the goal is going

  • to be to fit this beta in a good way,

  • and like, I'm not gonna do some deep dive on that here,

  • but in the one way that you are publicly familiar

  • with is just means squared error,

  • and when you find the beta that minimizes this

  • for your example training data,

  • then you get some estimate of beta

  • and you hope that on unseen examples

  • when you do X times beta, it gives you a good prediction.

  • So is that sort of make it somewhat concrete,

  • what I mean by a predictive model

  • and how you could view linear regression

  • as a predictive model in how you might use this

  • for guide design?

  • Okay, so obviously I'll tell you a lot more.

  • So, right, but linear regression is just sort

  • of the simplest possible example,

  • and so in our work we actually use,

  • some of the time, what are called classification

  • or regression trees, and so in contrast

  • to here where you might have, say,

  • this, you might have a bunch of these features,

  • right, like how many GCs were in my guide,

  • and then another feature might be,

  • was there an A in position three,

  • and you can put in as many as you want,

  • and then you get all these betas estimated.

  • So it's very simple, because in that case,

  • none of these features can interact with each other,

  • right, you just, you know you just add X times beta one

  • plus X times beta two, so we call this like,

  • a linear additive model.

  • In contrast, these trees allow very sort

  • of deep interactions among the features,

  • so this might be how many GCs,

  • so, of course, this is just, I didn't,

  • this is not suited to the features I just described,

  • but this might be some feature like,

  • I don't know, proportion of GCs,

  • 'cause now it's fractional, and then it,

  • this algorithm, which is gonna train the betas,

  • so find a good value beta, well, sort of

  • through a procedure that I'm not gonna go into detail

  • for all these models, how it works,

  • but it's going to somehow look at the data

  • and determine that it should first split

  • on the second feature at this value,

  • and then it will sort of keep going down that.

  • It says, "Now partition the examples

  • "in my training data like this."

  • And then on the second feature in this way,

  • until you end up at the sort of leaves of this tree,

  • and these leaves are the predictions.

  • And so when you do it for the training data,

  • whichever training examples here end up at this leaf,

  • you basically take their mean,

  • and that's now the prediction for that leaf,

  • and if you take a new example,

  • you basically just pipe it through this,

  • these sort of rules, and you end up

  • with that kind of prediction.

  • This, simplified, but I think it's a good conceptual view,

  • and this is just another way of thinking

  • about it is if you only had two features

  • and you drew them, like, one against the other,

  • then effectively, every time you make a branch here,

  • you're kind of cutting up this space.

  • So that's also just another way to think about it.

  • And so, also, so this is, now all over the press nowadays,

  • and whenever I give these talks,

  • there's a bunch of young, hungry grad students who say,

  • "Did you do deep neural networks?"

  • 'Cause that's what everybody wants to do now,

  • and so deep neural networks, they're kind

  • of like a really fancy linear regression.

  • So you could think of these as the Xs in linear regression,

  • you can think of this as,

  • imagine there's only one thing out here,

  • I should have done a different picture, but that's just Y.

  • And so this again is a mapping where you give it the Xs,

  • you do a bunch of stuff, and out you get a Y here,

  • except linear regression, you know,

  • is this very simple thing,

  • and now you can see, there's all these other kinds

  • of, we call these like, hidden nodes,

  • and so there's this complicated mess now of parameters,

  • beta, and again, I'm not gonna go into it,

  • I just want to give you a sense that like,

  • linear regression is this very simple thing,

  • and there's a lot of other models that let you do much,

  • much better prediction, and these are typically the kinds

  • of models that we use, because they're more powerful

  • if we care about prediction.

  • But the flip side is they're actually very hard

  • to interpret, and so if you want to ask a question,

  • like, was it the GC feature that is most important

  • in your guide design, which is what I always,

  • you know, get a question like this,

  • and we can do our best to kind

  • of poke and prod at this machine,

  • but it's always a little bit ad hoc,

  • it's hard, the more complicated the model,

  • then the, you know, the better we might predict

  • and the less interpretable it is,

  • and so there's always this kind of tension.

  • So right, and so what are some of the challenges?

  • So I've sort of shown you sort of some,

  • like, increasing amount of complexity in some models,

  • and so one of the big difficulties is,

  • if I pause at a very complex model

  • with a lot of parameters, then I need a lot

  • of data in order to actually fit those parameters,

  • and if I don't have enough data,

  • what's gonna happen, is the parameters,

  • you're gonna find this very specific setting

  • of the parameters that effectively memorize

  • the training data, and the problem then is you

  • give it a new example that you really care about,

  • you say, I want to knock out that gene,

  • and it's never seen that gene,

  • 'cause it's sort of memorized, but we say it's like,

  • over fit to the data, it doesn't actually generalize well

  • to these unseen examples.

  • So there's another tension here,

  • which is, you want kind of complex, rich models

  • that can capture very complicated prediction spaces,

  • but if you don't have enough data to warrant them,

  • then you're gonna shoot yourself in the foot.

  • And so, like, so you know, when you learn in this area

  • as like, an undergrad or a grad student how

  • to do rigorous machine learning,

  • this is one of the things you need

  • to understand, how to control that knob,

  • how to know if you have enough data

  • for the model you're looking at,

  • and things like this,

  • and so, in this, this was kind of fun for us to do,

  • the guide design, because a lot

  • of people are using this now,

  • and we know we didn't super over fit it

  • because people are telling us that it's useful,

  • but, you know, we did a lot of due diligence on our end

  • to convince ourselves that that was correct, as well.

  • Right, and even just asking,

  • how can I evaluate a dataset?

  • You know, there's a lot of papers out there

  • where people sort of dabble in machine learning,

  • they go use something and they prove

  • that, have this amazing predictive capacity,

  • and in many cases, because it's such a popular area now,

  • people just kind of jump in and they do something,

  • and these evaluations are actually not rigorous,

  • and so you need to read things very carefully,

  • you can't just look at a result the way you can like,

  • look at a gel, or maybe I'm wrong here,

  • correct me if I'm wrong, if it's not fabricated,

  • then, like, you know it's not just what you see is

  • what you get, you need to think very deeply

  • about what they've done and if that makes sense.

  • Right, and then of course, you know,

  • like, in linear regression, for example,

  • we assume that the error term is,

  • for example, Gaussian distributed,

  • and so this is like, a various sort

  • of technical, precise thing.

  • Of course, almost, in most cases,

  • in every model, every model has some sort

  • of assumption like this, and basically

  • when you're working with real data,

  • every model is wrong, like,

  • there's no way, I mean, the real model is physics,

  • right, and we're not using physics-based models,

  • they're hard, but they're not scalable,

  • and basically it's almost unheard

  • of except in a few, select fields of like,

  • molecular dynamics, and so again, this is one

  • of sort of the tricks of the trades is how do you,

  • and to do it rigorously, is how do you know that like,

  • that this assumption, even though it may be violated,

  • that your model is still okay,

  • and so that's another consideration.

  • Sometimes we think it's okay, but maybe we can,

  • if we can violate it less by, you know,

  • doing something different with the data before we give it

  • to the model, maybe we can do better.

  • And then in a lot of work I do,

  • sometimes you know exactly a very good model,

  • but it's just really, really slow,

  • and so you need to be clever about actually how

  • to speed things up, and it might be slow when you're trying

  • to infer those parameters during the training phase, I mean,

  • they'll often refer to a training phase

  • where you have this data, and then there's this sort

  • of deployment or test phase

  • where you're just using the model and making predictions,

  • but you don't know the right answer,

  • unless you're evaluating.

  • So that's sort of, I guess, just a sort of overall,

  • little bit of a view of a slice of machine learning,

  • does that, are there any questions so far?

  • Okay.

  • - [Student] So one of the challenges,

  • of course, is you have to

  • have a handle, you have to have a handle

  • on all the features that are important.

  • - Yeah, absolutely, and so in a way,

  • one of the sort of miracle breakthroughs

  • of deep neural networks, and that,

  • like, and you'll, that's touted is

  • that you don't need to know them,

  • because it sort of is such a rich machinery,

  • it can create features on the spot.

  • But the problem with them is it needs a vast amount

  • of data to do that, and the other problem is

  • they've really only been shown to be state of the art,

  • I mean, they've broken like, 30 year benchmarks in like,

  • on things like speech processing and vision tasks,

  • which are images and speech

  • which have particular properties,

  • like continuity and time for speech,

  • continuity in space for vision, and so,

  • and people are applying them everywhere now,

  • but in many cases people say,

  • "I used a deep neural network."

  • But you know what, they just used some simpler model,

  • they would have done just as well,

  • but that is sort of, to what you said is exactly right,

  • and when you don't use deep neural networks,

  • and even, then, sometimes if you don't have a lot

  • of data, then yeah, this sort of feature engineering is

  • what they call it, is what do we put in,

  • do we give it just the GC content,

  • do we give it this, but kind of one

  • of the beauties of machine learning is

  • that you can pause it a whole bunch of things,

  • throw them all in, and if you have enough data,

  • then it can tease apart what is relevant

  • and what's not, but again,

  • there's always this tension of like,

  • if you give it too many things

  • and you don't have enough data to warrant it,

  • then you're gonna get a bit of nonsense out.

  • - [Student] Is it true that you still can't look

  • under the hood and see which features--

  • - Yeah, so this is, I mean, this is a very active area

  • of research, although I would say it was a bit obscure

  • up until very recently, and the reason it's

  • become much more important is

  • because deep neural networks are so important,

  • and people are using them for problems that like,

  • you know, physicians are like,

  • looking at a CT scan and stuff, and medical areas,

  • especially, people want to know why did it make

  • that decision, like, I'm not gonna give my patient this

  • or that or not do that because your weird,

  • crazy deep neural network just popped out .8, you know?

  • And so, but this is a very active area of research,

  • and there is, there are things you can do,

  • but it's, I would say it's always,

  • in some sense, an approximation over what,

  • because at the end of the day, it's something operating,

  • it's like, you know, a human works as a whole, right,

  • and then you ask some very specific questions

  • about some individual thing, well, you know,

  • like in biology, we ignore, like,

  • the whole rest of the system just

  • to try to understand, 'cause we don't have a choice, right?

  • And it's very much like that in machine learning, as well.

  • So you do ask, and you get something

  • that's useful out, but you know that

  • by ignoring every other system

  • and every other thing that you can only go so far.

  • Alright, and so I just want to give you,

  • also, a little bit of a flavor

  • of what I could call are the two main branches

  • of machine learning, and everything I've just told you

  • about is what we call supervised learning,

  • and so in supervised learning,

  • why is it supervised?

  • It's because in the case of like,

  • guide design that I'm gonna tell you about,

  • the way we do it, the way we build this predictive model is

  • I give it a supervisory signal in the sense

  • that I give it a bunch of guides, and for each guide,

  • I know how effective the knockout was,

  • and that's the supervisory signal.

  • And so without that, I mean, in a sense,

  • how could you make predictions, right?

  • You can't build a predictive model if you have no examples

  • of how well it's worked in the world,

  • and the trick is to take some limited number of examples

  • and do something that generalizes

  • to a much larger set of data.

  • And so that's, and this is, you know,

  • obviously used for any number of problems

  • in biology, outside of biology, in finance,

  • like anywhere, in chemistry, you name it, this exists.

  • And so I, my own background, I should say,

  • has been a little circuitous, like,

  • my undergrad was actually in physics,

  • and then I actually did computer vision,

  • and then I got into computational biology during my PhD,

  • and I've meandered around like,

  • through sort of the (mumbles) informatics,

  • proteomics, and so in fact, these are kind

  • of problems that I have looked at and now more recently,

  • doing some of this CRISPR guide design.

  • And so, right, and one distinction also is,

  • am I predicting a yes, no, like knockdown,

  • or knockout versus not, that's classification,

  • 'cause you have a class of yes, it was,

  • versus no, it wasn't, and regression is sort

  • of like, how much was it knocked out,

  • or like, you know, what fraction of cells

  • or what was the cutting frequency,

  • something that's not discreet,

  • and we don't need to pay a lot of attention to that,

  • but in our papers on this topic,

  • we have shown that people like to view this problem

  • as zero one, like, you know, cut or it didn't cut,

  • but if you, you know, the assays that you guys produce like,

  • give much more information than that,

  • and it's actually, it's kind of crazy

  • to just ignore that, and therefore we want

  • to be in a regression setting

  • where we're actually predicting real values,

  • and, you know, mostly we care about the rank,

  • 'cause the kind of scenarios that we've been trying

  • to tackle are, someone says,

  • "I want to knock out this gene,"

  • so we want to go through and, I should say,

  • everything we've done here is also for Cas9,

  • just like, the wild type Cas9.

  • And so we want to, you know, we say,

  • our goal is to essentially say,

  • rank all the possible guides for that gene for Cas9,

  • and then compute the on target,

  • according to our models, compute the off target,

  • and depending on your use case, you know,

  • you balance those in some way

  • that you think is appropriate.

  • And so we want rankings here, like,

  • we don't care if we know the exact number,

  • although, I guess, some of our users say,

  • like, "What does that number mean?"

  • And that's very hard to do because

  • of the nature of the data that we take in,

  • and this sort of different assays

  • that get incorporated, and how we have

  • to kind of massage the data to throw them all together,

  • which I'm got gonna talk about here,

  • but it is a very difficult problem, actually.

  • Right, and so this is, in this setting it's pretty easy

  • to evaluate the model, it's not trivial,

  • but it's much easier than the second setting,

  • which is unsupervised, which I'll tell you

  • about in a second, and the reason it's easy is

  • because I have some set of examples

  • for which I know the answer, right,

  • and so the kinds of things

  • that we do is we basically fraction off the data,

  • we keep some amount of our data away from us

  • and we just never look at it until the very end,

  • and then we do a bunch of our own sort

  • of methods on this other part,

  • and we're allowed to look at it as many times as we want,

  • and then at the end of the day,

  • then we go kind of go once to that final validation set

  • and say, "How well did we do?"

  • And we can do that for different methods,

  • right, so we can look at competing methods

  • from the literature, or competing methods we make up,

  • and say like, "Okay, is this, you know, how do they fare?"

  • And so in that sense it's like,

  • there can be problems with that,

  • but it's much easier than in this case,

  • and so this is unsupervised learning

  • and I'm not gonna talking anything about

  • that other than this one slide here today,

  • but I just wanted to, if you've heard this,

  • I just thought it might be useful,

  • so probably everyone here at some point

  • in their life was doing something with gene expression

  • and you would draw these little plots from,

  • I think it was from Berkeley, right, wasn't it, Eisen,

  • Michael Eisen, okay, maybe this is a different,

  • I don't know, who here's drawn these plots

  • at some point in their life?

  • Okay, at least, more than people

  • who could define machine learning.

  • So right, so one access here,

  • or I can't actually see, what is it, yeah,

  • so each of these would be a micro-experiment,

  • every column here, and then people would say,

  • "Cluster these," for, you know,

  • people with colon cancer or something, and then they'd say,

  • "Oh, are there subtypes that we can see

  • "by clustering them according to this tree?"

  • And you'd sort of cut off the tree

  • and give you some clusters.

  • So in this case, the sort of analog

  • in guide design would be, I give you just the guides,

  • but I don't tell you anything else, like,

  • I don't tell you how efficient the knockout was,

  • at all, and then so you're just trying

  • to understand some properties of this set of guides,

  • and for guide design this, as far as I know,

  • this is not interesting or useful,

  • I don't know what you would do with it,

  • but in other cases it is, and I've spent a lot

  • of my life actually in statistical genetics,

  • where you can do some really cool stuff.

  • So when you're doing GWAS, like, geno (mumbles) study

  • and you're scanning through looking

  • for each marker, is it sort of related

  • to this phenotype or this trait,

  • one of the things that really messes you up there is

  • when you have heterogeneity of say,

  • populations from the wold,

  • and so one thing people do is they kind

  • of do this unsupervised learning,

  • in this case it's called principal components analysis,

  • you can take all of the genetics and you do this process,

  • and then say, each person in this room, like,

  • say we've measured you to 23andMe, whatever,

  • you have all your genetics, and then you can do,

  • sort of do this unsupervised PCA,

  • I know each person will get two numbers out

  • of this analysis, this unsupervised analysis,

  • and you can plot them here, and when you do that,

  • it actually kind of recapitulates the map

  • of Europe if you happen to know where they're from.

  • So that's not going into the algorithm,

  • and so that's kind of cool, I mean,

  • I haven't told you exactly why we want to do this,

  • but it's just kind of a super neat result,

  • and totally irrelevant to this talk.

  • Right, so that's my sort of very quick

  • and rough intuitive intro to machine learning,

  • and I'm gonna jump into how we use machine learning

  • for the on and off target problem

  • in guide design for knockout, but again,

  • are there any questions at this point?

  • Okay.

  • Alright, so we, this one, the first one I'm gonna talk

  • about is on target, and that one's already published,

  • this is with John and other folks at the Broad,

  • and then this one is in revision now,

  • although it's on bioRxiv, not the most recent version,

  • and of course, our overall goal was to have some sort

  • of end-to-end guide design service,

  • but we've got to do this in pieces,

  • there's almost no end to this, right,

  • even when we have our current one,

  • there's gonna be a million things we can do.

  • And I think, usually, so I actually often talk more

  • to computer scientists, rarely to biologists,

  • and I have to explain in great detail

  • what it is we're actually doing,

  • and I suspect everybody in this room knows,

  • but just to be clear, again,

  • I'm gonna focus on the knockout, although often the assays

  • and the modeling we do are actually cutting frequency,

  • and so in this case I'm considering knocking out this gene,

  • I'm gonna use Cas9 and maybe there's like,

  • NGGs at these four positions, of course,

  • it's more than four, and the idea is I want the one

  • that's gonna most effectively knock it out,

  • or I want to rank order them in some way,

  • right, and let the user decide what to do.

  • And so I want the best, best

  • guide to guide me to the good position,

  • and then, of course, the flip side is

  • if I've chosen that particular guide,

  • then I wanna make sure I'm not disrupting the rest

  • of the genome, and so we model these separately,

  • and now I'll just go through one at a time.

  • Right, so this is a little schematic of what I mean,

  • which I've been alluding to, I think,

  • in the introduction, about how we're gonna do guide design,

  • so again, this is gonna be supervised in the sense

  • that John's gonna give us a bunch of examples

  • in his wet lab where he's measured the knockout efficiency,

  • and what I'm gonna do is I'm gonna take the guide,

  • and it's actually, we have a 30mer here,

  • so it's 23 plus I think it's four extra

  • on one side and three on the other,

  • and to be honest, like, that's not something we ever played

  • with much, somehow I think we just started using 30,

  • maybe we tried just the guide and this helped,

  • I don't actually remember right now,

  • but it's a 30mer sequence,

  • and then we're gonna learn some function,

  • and so again, in the simplest cases,

  • function might be linear regression,

  • where you're effectively just learning some betas,

  • and then it's gonna allow us to make predictions,

  • and the game is gonna be to collect as much data as we can,

  • and to fit these in a suitable way, to pick a good,

  • a good model, and then to fit it in an appropriate manner.

  • So that's the setup, and right, and so these,

  • I've just drawn here kind of three of the key components,

  • the decisions you have to make when you're doing this kind

  • of thing, so the first one in

  • which Dana just was mentioning is like,

  • what features do we give it here?

  • Somehow I go from this, you know,

  • ACTG guide to some numbers,

  • and I'll tell you a little bit about that,

  • and then the other thing is we need some measure

  • of the supervisory signal, is how effective it was,

  • and so that's not, as a computer scientist,

  • that's not in my control, but John

  • and you guys are super clever

  • and managed to figure out how to measure these things,

  • and then our job is to think about,

  • you know, when you measured it for that gene

  • and that gene, are those comparable,

  • and can we throw them together or not,

  • and things like this, or should we,

  • you know, transform it in some way

  • so it better adheres to the model

  • that we're trying to use, maybe go see a noise

  • and things like this, and then finally,

  • what model are we using, or are we gonna try a bunch

  • of models, and how are we going to decide which one is best?

  • So, right, and so if John, in this first paper,

  • he had some data already there,

  • and then he gave us some more,

  • and overall we had, I think I have a slide on this,

  • it was for just 15 genes, he did systematically

  • every possible place you could deploy Cas9.

  • Okay, great, so right, he had this one he'd

  • already published, which was,

  • they're all self-surface proteins,

  • and so you could use flow cytometry and fluorescence

  • to separate them and get some measure

  • of knockout efficiency, and then the second class

  • of genes was using a drug resistance assay

  • where I guess you, it's known,

  • and I guess they show in the paper, as well,

  • that it reestablished that when you apply this drug

  • and you successfully knock out the gene,

  • then the cell survives, and otherwise not.

  • So using these kinds of tricks,

  • and probably many more that you guys develop,

  • we can get some sense of if the protein was not,

  • you know, not there are not functioning.

  • And you can see here, I guess, you know,

  • large data is in the eye of the beholder,

  • so for modern day machine learning,

  • this is like, minuscule, like,

  • it's a little sneeze, basically.

  • And that makes machine learning really hard,

  • actually, it makes it easy because things are fast

  • and they're not unwieldy and I don't need

  • to worry about memory and compute time,

  • but it's really bad from getting a good model,

  • and that's actually one of the big challenges

  • in this area, is there's very limited data

  • at the moment, and I'll never forget,

  • one talk I saw years ago by Google,

  • like, before machine learning was super famous,

  • and they actually at the,

  • one of the big machine learning conferences,

  • they did this plot,

  • and they basically showed the performance

  • as you change through these different models,

  • and you'd see this model's better, or this model's better,

  • this and this, and then they,

  • I went to the next slide, and they said,

  • "Okay, now we just made the dataset 100 times bigger,"

  • and basically that so dominated the difference

  • between the models that you're like, just, okay,

  • just get more data and don't worry

  • about what you're, you know, what model.

  • But when you're in this regime,

  • you should care about the model,

  • and to be fair, even in that regime,

  • that was just one specific problem,

  • but I think it kind of drives home a very nice point.

  • And so, right, I'd mentioned that different genes

  • and different assays like, kind of yield measurement

  • in a different space, and I'm not gonna,

  • we're actually, we haven't, we have ongoing work,

  • it is a sort of machine learning project

  • on how to handle that in a rigorous, nice way,

  • and in the meantime, we're doing some stuff like changing it

  • to ranks and normalizing what the gene and stuff,

  • I'm not, I mean, if you want to know,

  • we can go into this offline,

  • but I'm not gonna talk about that here.

  • Right, and so now back to this sort

  • of featurization, what we call featurization of a guide,

  • so right, we have the 20mer guide,

  • and then a 3mer after the PAM,

  • and for, before, and we need to convert

  • that 30mer nucleotides into something that's numeric,

  • because any model I want to use in machine learning,

  • fundamentally it assumes there are numbers there,

  • they can be real value, they can be negative,

  • they can be discreet, it doesn't matter,

  • but they've got to be numbers.

  • And so, you know, we're not the first people

  • to have this problem, there is a rich history

  • of computational biology and ways

  • to do this, and so this, you know, like we did,

  • this is a pretty standard thing,

  • but just to help you wrap your head around some

  • of the ways in which we might want to do it,

  • imagine that, so we're, a lot of the features we're going

  • to use are based on the nucleotide content,

  • and so what we're gonna do is we're gonna kind

  • of make a dictionary, where if there are four possibilities

  • for a letter, then each of those basically gets,

  • we call it one hot, because one of these bits is on,

  • and the rest are off, right,

  • so this is the one that's hot,

  • that's the one that's hot there,

  • it's called a one hot in coding, but I think you can see,

  • you just enumerate however many things you have,

  • and then the code for it becomes however many there were,

  • so here there's four, which means there's four digits,

  • and then you just kind of randomly give it one to A,

  • and that position to T in this, and now what you do is

  • when you want to convert this to a long, numeric thing,

  • first you look up the T, so you start with a T,

  • so then you get 0010, those are the first four numbers

  • in the long numeric string,

  • and then you look at the second letter and it's a G,

  • so then that's gonna be followed by 1000,

  • and you keep going like this.

  • So that's gonna give you some very long string,

  • but we're not gonna stop there,

  • we're gonna actually do this for

  • what we call order two nucleotides,

  • and we've got this sort of other kind of dictionary.

  • You can do this up to order three,

  • you can do all kinds of other stuff

  • that I'm not even gonna go into,

  • I just want you to like, sort of wrap around,

  • like, in your head, how can you actually do this,

  • and also, the way I've just described this to you,

  • like, in this example, I, it's position specific

  • in the sense that I start here on the left

  • and I ask, was there a T?

  • So this refers to the first position,

  • and when I do this for every guide,

  • it always refers to the first position, right,

  • and so when this is important in the model,

  • it's because it's important in the first position,

  • but I can also do things that are sort

  • of positioned independent, so,

  • and I'm gonna actually do that,

  • not just for GC, I'm gonna do it

  • for every two pairs of letters, and I'm just gonna say,

  • "How many GCs were there, how many TGs were there?"

  • And things like this, and I'm just gonna tack those on

  • to this thing, this thing's just gonna get longer

  • and longer and longer, right?

  • And this is where we need to be careful,

  • like, 'cause it's so long that I don't have enough data

  • to actually deal with it properly.

  • But in this case, like, even though they're really long,

  • because so many things are zeros,

  • it's actually not quite as long as we think it is

  • in terms of the amount of data needed.

  • And then something that comes out as important,

  • and so this is, we sometimes call John our like,

  • oracle featurizer, 'cause John is one, he's smart,

  • two, he has very good intuition,

  • and three, he's up on a lot of literature,

  • and so he said, you know,

  • there's crystallography experiments that show that,

  • you know, something happens at the 5mer approximal

  • to the PAM in sort of a cohesive way,

  • followed by something that over here and over here,

  • so maybe when you're doing the thermodynamics,

  • in addition to looking at the whole 30mer,

  • maybe also look at it just in these sub-portions

  • and things like this.

  • And so we interact with John and we get features

  • in this way, as well, so we kind of use our rich history

  • of just computational biology,

  • and then we use, you know, domain specific expertise,

  • and if we had, you know, sort of an infinite amount

  • of data, we could probably skip thinking

  • through these things at all.

  • Alright, and so the last part,

  • like, I've now told you a little bit

  • about how we do this, and you guys know more

  • about this than I do, and so now this sort

  • of last puzzle piece is what kind of model are we going

  • to use, and so I've already told you a little bit

  • about these regression trees, and so actually a lot

  • of our CRISPR, these are just very, they're very rich models

  • because you can get these interactions, right,

  • if I split on this feature, let's say this is GC content,

  • and then I go down to here, and this is,

  • let's say A in position one,

  • then this is a very strong interaction

  • between these two features,

  • because I have to go down this path before I get it.

  • And so you can get these very complicated interactions,

  • which is also why it's hard to interpret,

  • because you kind of need to like,

  • look at every interaction all at the same time

  • to kind of understand it.

  • So what we actually do is we actually use a bunch

  • of those models together, there's a way

  • to combine these models to do even better, still,

  • and this is called boosting, does it have the name

  • on here, boosting, somewhere?

  • Boosting, so boosting is a way where you take some sort

  • of somewhat simple model, these aren't simple models,

  • but we actually crippled them a bit,

  • we keep them very, very simple,

  • and then we apply this boosting technique.

  • And so let me just explain to you intuitively

  • what boosting is like, and this is, I think,

  • pretty easy to understand on an intuitive level.

  • You take your data and you fit the parameters

  • on your training data, and then you go to your training data

  • and you ask how well did my physicist fitted model do

  • on every one of my training examples,

  • and then you weight each of your data points inversely

  • to how well it did.

  • In other words, if on that training point,

  • after I fit the model, it did badly,

  • then it's gonna get up-weighted,

  • and the stuff that it did, you know,

  • really well on is gonna kind of fall out,

  • it's gonna get a very low weight.

  • And then you reapply this learning algorithm,

  • you re-fit another tree, and you add the results,

  • then you start averaging the results

  • from a sequence of these trees.

  • And so there's a whole theory underpinning

  • why you can do this, what that's actually doing,

  • but on an intuitive level, that's all it's doing,

  • and so it's, in some sense it's refocusing its energy

  • on getting those things that it didn't get right

  • by subsequently adding in another tree, and these are,

  • it turns out to be very powerful models

  • in a large number of domains.

  • And so, in fact, that is what we use here.

  • - [Student] You also do an (mumbles)

  • where you dropped out a feature,

  • and see whether it made a difference,

  • (mumbles) had a couple of features,

  • 'cause you can see the (mumbles).

  • - Yes, so I guess there's two things we do

  • when we're doing things, is one is,

  • we try to get the best model possible

  • that we think will be the best, you know,

  • like, for whoever wants to use it,

  • we really think this is the best thing,

  • and then another thing we do is we want

  • to interpret it as best as we can,

  • and so sometimes we drop out a feature

  • because we think maybe it'll be better,

  • but usually dropping one feature's not going

  • to make a difference,

  • because if it's just one feature and we drop it out,

  • it would have known it wasn't useful anyway, usually.

  • Like, it could be that if you, you know,

  • posited that there was millions of features

  • and then you got rid of half a million,

  • that might help you if they were irrelevant,

  • but if it's just one irrelevant one, it can figure it out,

  • or else it's just doing a bad job, anyway, basically.

  • But then you might do it, like, you know,

  • we have some figures in our paper about here's the accuracy

  • with all the features we finally used,

  • and here's the accuracy if we pull this one out

  • to give people a sense of like, well, how important was it?

  • But that doesn't necessarily tell you,

  • actually, so it's actually more nuanced than that,

  • because it could be that if I use this feature by itself,

  • and we see this all the time, I think this happened

  • with my microhomology, which we don't put in our model.

  • If you put that in just by itself,

  • it turns out be quite, or somewhat predictive,

  • I don't remember how predictive this was.

  • But if you take the model we've already settled on

  • and you add it in there, it makes no difference

  • because things are, you know, like,

  • for example, nucleotide features cover GC content,

  • and GC content is a proxy to thermodynamics,

  • and you don't know what's a proxy

  • for what on things like this,

  • and so because everything's interacting, like,

  • you can't, all you know is in the context of this model,

  • if I pull it out, it doesn't change the performance,

  • but that doesn't mean that it's actually not predictive,

  • right, and then the flip side is,

  • if I put it in by itself and it's not predictive,

  • it doesn't mean it's not predictive,

  • because I could have put it in with a bunch of other stuff,

  • and together it might have been predictive.

  • So this is why it's actually hard to do that,

  • and it's also why it's hard to interpret,

  • yeah, but you can, you know, we do what we can

  • in that vein to sort of poke and prod as best we can.

  • - [Student] You said something, to the previous slide,

  • and I wanted to make sure we got it right,

  • which is that, sort of, you know what,

  • we're having to go in because the dataset is small and say,

  • "These are the features we hypothesize are important,"

  • that's where, give it that--

  • - Yeah.

  • - [Student] (mumbles) and we might be leaving things out--

  • - Absolutely, yeah, completely.

  • - [Student] That's really what you were saying was

  • that if you had a larger dataset,

  • we could, we wouldn't have to be as correct

  • about those features going in for the model

  • if we start applying some of those more complicated things

  • about (mumbles)?

  • - So yes and no, yes, the larger the data,

  • the more complicated a model you can give it

  • with more parameters,

  • and the more complicated a model it is,

  • then with infinite data, it can make up,

  • figure out these features, but so you need

  • to both increase it the data

  • and increase the complexity of the model.

  • So if you did just one or the other, then that's not enough,

  • unless you're in a setting where you're using too complex

  • of a model than you should be

  • and then you increase the amount of data,

  • but yeah, but I mean, I think your intuition's correct,

  • I just wanted to make sure that you know that you needed,

  • these things need to go kind of in lockstep, in a sense.

  • - [Student] So getting more data is not a magical fix

  • for us not understanding which features are (mumbles)?

  • - No, and in a sense, the more data you have,

  • you can afford to do more complex modeling,

  • and the more complex modeling, the harder it is

  • to tease apart what's going on,

  • unless you went to some physics based thing,

  • but like, nobody really does that.

  • Yeah.

  • Right, and so at the end of the day,

  • we're using these boosted regression trees,

  • and so when we've done, this is for the

  • on target problem using those 15 genes,

  • and the colors look slightly different.

  • Okay, so anyway, this is, and John and colleagues,

  • actually, this is the paper they had in revision

  • when I first went up to John and started talking to him,

  • and that is, was sort of the state of the art

  • for on-target, predictive accuracy of knockout,

  • and that's performances, in blue,

  • so this is flow cytometry data,

  • this is drug resistance, and this is combined,

  • and this is when you train on one test,

  • on the other, of course, we're careful,

  • when we do that, we don't literally train

  • on that whole dataset and test on it again,

  • we partition it in this way that I described,

  • this is just a shorthand notation to say,

  • we're only considering this data when we do that.

  • And similarly here, and then you can see,

  • the boosted regression tree is doing better.

  • And actually, also, everybody had

  • actually been doing classification,

  • and we can show through a series

  • of what we call experiments, like,

  • in silico experiments that, really,

  • you don't want to do classification here,

  • you want to do regression,

  • because the model can use that fine grain assay information.

  • Right, and so some of the features that come out here,

  • and I put them into groups, so, for example,

  • the topmost, and again, take this with a grain of salt,

  • because the other thing is, with these models,

  • sometimes you can fit them this way,

  • you can just tweak the data,

  • you could imagine taking out 5% of the data,

  • you can refit it,

  • and you might get something very different here,

  • because there's many ways in which you can get

  • an equally good predictive model.

  • And so as much as people really want

  • to latch onto these lists, and we make them,

  • like, I really, you know, and whenever we sent something

  • in for review, they're always like,

  • "Wow, we need to understand this better."

  • But there's really only so much you can do

  • with this kind of an analysis,

  • but I think, you know, it gives you a flavor of like,

  • you know, stuff that's not important will not be

  • in this list, that much you know, and I think, you know,

  • so there is something to be had from this,

  • you just shouldn't read it too, too literally, but.

  • So the position dependent on nucleotide features

  • of order to, so one example of that would be there's a TA

  • in position three, and this is the importance

  • of all of those features together, and it turns out,

  • in this case, to be the most important thing,

  • and then position dependent order one, so that's like,

  • is there a T in position 21, or 20, or something like this,

  • and all of those together turn out to be important.

  • And then position independent order to,

  • this is something that includes GC, right,

  • 'cause one example of a position independent order

  • to feature is how many GCs were in that 30mer.

  • But you can see, even with the GC content in there,

  • the thermodynamic model that we use becomes out

  • to be also important, and moreover,

  • those little bits that John inferred were important based

  • on crystallography, like, actually taking subsets

  • of the 30mer turned out also to be important,

  • so you can see, that's the 16 to 20,

  • eight to 15, and then one thing that,

  • for knockout is important, is actually,

  • where is the guide knife actually cutting,

  • right, so if it cuts at the very end

  • of the gene, it might still form a functional protein

  • that passes this assay kind of thing,

  • whereas if it's at the very beginning,

  • it's sort of less likely.

  • That's not, you know, if you look at the data,

  • that's not crystal clear, but it's there enough

  • that it comes out to be important.

  • And then it turns out the interaction,

  • again, this is what we call another John oracle feature,

  • which is between the nucleotides on either side

  • of the GG in the PAM, if you actually say,

  • "I want to know what both of those are at the same time,"

  • that that turns out to be important,

  • and actually, and GC count still comes out, as well.

  • And then you can drill down, right,

  • these were like, groups of features,

  • these were all the position dependent order two features,

  • and then this is just the top of the list in rank order

  • of the actual individual features,

  • you know, if you go to the SI you can find all of them,

  • percept peptide is again, just a measure of like,

  • were you 50% through the peptide

  • where you targeted it, or were you 100% through,

  • so you can quantify that as a percentage,

  • or as an absolute number, like I was 30 nucleotides in,

  • and we put both of those into the model.

  • Right, the other one's amino acid cut position,

  • so those actually are very important,

  • and then the second most important one

  • is actually the 5mer end, the thermodynamics.

  • Alright, so, and this is actually,

  • it seems, I don't know if anyone here's using it,

  • but a lot of people tell me they're using it,

  • and there're actually two startups

  • that are using this now, as well,

  • and everything I've just told you is actually

  • from this paper, but we're right now in the process

  • of bringing in more data to try and improve it,

  • and trying to see how chromatin accessibility

  • and other things like this might improve the model, or not.

  • But that's in the works, that's not out yet.

  • And then one thing, I actually don't usually put this

  • in my talks, but I thought I'd get a question

  • about is just like, you know,

  • what if you had more data, how much better would you do,

  • and this is like, one of the first things John asked us,

  • and he said, "Maybe you can draw a plot,"

  • something like this, so, you know,

  • we actually had, I guess we had 17 genes,

  • and so we always want to partition the data

  • when we're testing, right, so if I have 17 genes

  • and I want to use as much of it as possible,

  • I could use 16 to train and test on one

  • and then I could keep doing that,

  • I could shuffle and then do it with a different one,

  • a different one, and so that's this end here,

  • and that's the performance there,

  • and you can sort of see, as I let it use fewer

  • and fewer number of genes to train,

  • the performance kind of goes down,

  • and John was like, you know, is it gonna plateau,

  • does that mean we have enough data,

  • and getting more data won't help,

  • which, you know, of course he was hoping we would say yes

  • to that, but, and you could kind

  • of make up that story in your head based

  • on this plot, but there's a whole bunch

  • of caveats with that, and it kind of looks like that here,

  • but I don't think it's the case,

  • and if you want to know more, maybe ask me after, but.

  • Yeah.

  • - [Student] What's the Spearman-r (mumbles)?

  • - Oh, I'm sorry, yeah, Spearman-r I'm gonna talk a lot

  • about, so when we want to evaluate how well it does,

  • let's imagine that I have, you know,

  • 10 guides that I've held out, and I've trained my model,

  • and I would say, "How well does the model work?"

  • So I know the knockout efficiency for those 10 guides,

  • and I had my predictive model,

  • and what I want to do is basically take the correlation

  • between those.

  • If, you know, if one, I don't know if we care

  • about the exact values, but we care about the rank order,

  • and so we do it actually with the rank.

  • So we just convert the predictions into ranks,

  • and then we convert the real assay values into ranks,

  • and then we do like, a Pearson correlation.

  • So it's really just getting at,

  • are my predictions in the right order,

  • according to the ground truth, and Spearman is doing it

  • on the ranks because there's a lot of messiness

  • with the raw values that we get out here,

  • and this, I think, is enough that it is still very useful.

  • - [Student] That is one (mumbles)?

  • - Oh, yes.

  • (student mumbles)

  • Yes, that is correct, yeah.

  • Exactly.

  • So that's the on-target, and now I'm gonna,

  • what time, I'm all confused, I thought we were starting

  • at eight, but it started at 8:30,

  • so okay, I think this is good,

  • yeah, lots of time, yeah.

  • So now I'm gonna talk about the off-target,

  • but unless there are questions about the on-target first,

  • that's sort of the much simpler one

  • from a machine learning perspective,

  • it was much, much easier to do,

  • the next one is a bit of a nightmare. (laughs)

  • - [Student] Can we come back to it later--

  • - Yeah, we can, absolutely.

  • - [Student] Look at the parameters that we--

  • - Yeah, yeah, yeah, yeah, definitely, yep, okay.

  • Right, so yeah, so this is a much more challenging problem

  • for us for a number of reasons,

  • and so one thing is that we're now kind

  • of looking for accidents, right,

  • we're not, we don't have this perfect complementarity

  • between the guide and the target,

  • which you have for on-target, we're saying,

  • well, did it accidentally do something there?

  • So now we need to consider mismatches

  • between the guide and the target,

  • and moreover, we need to do that across the whole genome.

  • So now I'll suddenly, instead of just saying,

  • does this guide and this target,

  • is it gonna cut and knock out the gene?

  • I need to scan the entire genome

  • and then I need to somehow predict

  • what is, how active is it gonna be at each of those places

  • in the presence of some number of mismatches.

  • So this is just like, actually, wildly more difficult

  • and computationally expensive, and so,

  • I mean, one is, I have to scan the whole genome,

  • so that by itself is really annoying,

  • and then from a modeling perspective,

  • the fact that I have to tolerate mismatches means,

  • I guess in computer science lingo,

  • we'd say this is like a combinatorial explosion

  • of possibilities, like, and I'll show you how it explodes,

  • is that the next slide?

  • No, but like, so the more mismatches I tolerate,

  • kind of, the more examples I need for the model,

  • because there's just so many possibilities,

  • and there's actually, I think there was even less data here

  • than for the on-target, depending how you count data,

  • but it's really, really limited.

  • Right, and so this very simple three

  • steps here took us awhile just to get to that,

  • and so ultimately, though, this is how we think

  • about the problem, is given a particular guide,

  • our goal is gonna be to find out where

  • in the genome there're likely to be off-target activity,

  • using our model, and then we also want

  • to summarize it into a single score for that guide,

  • because people don't want to like, for a single guide,

  • they don't want to like, ponder, I mean, they might,

  • for some use cases they might want to do that.

  • But I think a lot of people,

  • depending again on the use case,

  • they want just a single number

  • with the option to drill down

  • about where the off-target activity is.

  • So what we're gonna do is they're gonna start

  • with a single guide, and then we're gonna scan

  • along the genome and filter, essentially,

  • like, to a shortlist of potential active off-target sites,

  • and then once we have that shortlist,

  • then we're gonna use machine learning

  • to score each of them with one of these predictive models,

  • and then we're going to aggregate them all

  • into a single score.

  • And so most of the work that we,

  • well, all the machine learning is down here, and one could,

  • in principle, actually use machine learning up here,

  • as well, but there're just already to many moving parts,

  • and so we actually just stuck around to doing a heuristic,

  • which is to basically cut it off roughly,

  • actually, it's not a three mismatches, huh.

  • I just realized, this slide,

  • it's something a little bit different,

  • I think it actually goes up to six,

  • but it allowed to ignore the three (mumbles),

  • or something like this, so I apologize,

  • I don't have exactly the right details here.

  • But this shows you this, what I was telling you

  • about a combinatorial explosion,

  • which is if I tolerate only one mismatch,

  • so this is a back of the envelope calculation,

  • right, like, let's say there's 100 sites genome-wide,

  • roughly, and the more I go, like,

  • the more sites I have to consider,

  • and so when I do this filtering down to a shortlist,

  • like, I can only go so far and that's also why,

  • actually, it stands to reason

  • that machine learning could maybe do better here,

  • because this is a very rough criterion of just like,

  • number of mismatches, and we know that that's not

  • what's driving the model, because we've done the modeling,

  • but it definitely has a big impact,

  • and so it's not a bad thing to do.

  • So right, and there's lots of,

  • there's lots of software out there

  • that actually does this, like tons

  • of things that will go and find all of these things for you,

  • and I guess our add-on contribution is really

  • to do the machine learning on top of that in a way

  • that is really better than what's out there.

  • Right, so that's, all I'm gonna tell you

  • about this first step is it's basically just a heuristic

  • where we say we allow up to some number

  • of mismatches, and that's our shortlist for this guide.

  • So right now, the problem is,

  • how do we score the off-target activity

  • for each member of that shortlist?

  • And so there have been some previous approaches on this

  • from your community that's putting out these great papers

  • and they're making very clever models

  • that actually do surprisingly well,

  • in a lot of cases, given that they're like,

  • just sort of like, declared by fiat to be the model,

  • it's really amazing, and so we thought,

  • let's bring in machine learning

  • and see if we can do any better here,

  • and at the time we started to do this,

  • CFD, which actually was John's model,

  • in that paper we wrote together,

  • but we ourselves didn't get to the off-target problem,

  • 'cause we were, just didn't have time for that paper,

  • is basically counting frequency of knock-down,

  • according to some very simple features,

  • and I'll tell you a little bit,

  • I think I have a little bit more about that.

  • And so we're gonna, again, do this machine learning thing.

  • And so the last picture I showed you like this was

  • for on-target, and in that picture,

  • it did not show you the target,

  • if you might recall, it had only this one column,

  • which was the guide, and then it mapped directly

  • to the activity using just the guide.

  • And so, but now, because of the mismatch,

  • I have to include the target as well, right,

  • and on-target, by design, these were complimentary,

  • and so this was redundant information,

  • but now there might be mismatches between these,

  • so if I know this, I don't necessarily know that,

  • so I have to throw that into the mix, and so now,

  • it's this function I want to learn is a function

  • of the guide and the target,

  • and then I'm gonna play the same game,

  • I'm gonna get as much data as I can,

  • and I'm gonna try and learn this function.

  • And actually, actually in this case we're using only 23mer,

  • not 30, and the reason for that is

  • I just think the datasets we had to have 23,

  • and it was a real pain to make it 30, so we just left it

  • at 23, so these are how you store your data

  • in your excel sheets may get backed out,

  • we do machine learning if we get a bit lazy.

  • Alright, and so now I need to also play this game again

  • where I go from these now,

  • a guide and a target, and go to some vector,

  • and mostly I'm gonna do the same thing I did for on-target,

  • so everything I already told you,

  • I'm gonna do that for the guide again,

  • and that's gonna go in here, but then I need

  • to somehow also encode the mismatch,

  • and I can do that in different ways.

  • I could, for example, I could just say,

  • "There is a mismatch at position eight,"

  • but not tell it what it is, and then separately say,

  • "One of the mismatches was a T to an A,"

  • or I could say, "There was a mismatch at position eight,

  • "and it was a T to A."

  • And so there's some flexibility there,

  • and we actually do several of those things,

  • and again, if we had enough data, we wouldn't actually have

  • to enumerate these different things,

  • we would just kind of give it the components

  • and the model could figure out what the features should be,

  • but we have very limited data.

  • And then it gets, this is

  • where it also gets super complicated.

  • So, again, I mentioned this sort of combinatorial explosion

  • of possible number of mismatches,

  • so like, I don't know how much data there is

  • that has multiple mismatches, but it's like,

  • you know, maybe 1,000 examples,

  • and you saw how many mismatch examples there were,

  • right, like, I mean, thousands and thousands.

  • So we have almost no data, I would say,

  • it's almost like no data.

  • However, what John did is he built,

  • and I think this is gonna kinda is gonna sound insane,

  • I think it is insane, but it actually works,

  • is he took just CD33,

  • and he enumerated every single mismatch guide target pair,

  • and he measured the knockout efficiency for that,

  • and from that, we can, like, if it's just one mismatch,

  • we can build just a one mismatch model

  • and not consider the combinatorial explosion.

  • And that gets us a one mismatch model,

  • and that's actually what we do is,

  • we build this one mismatch model just from CD33 data,

  • and I'm super curious

  • to know if we actually had some more genes in there,

  • how the performance would change since, in general,

  • we're not at all testing on CD33 here, and then,

  • the thing is that we want to be able

  • to take any arbitrary example and kind

  • of bootstrap ourselves using that single mismatch model

  • to get a prediction for it.

  • So how am I gonna do that?

  • What I'm gonna do in the first, on this slide so far is,

  • I'm gonna pretend that this two mismatch,

  • or it could be three, if there were three mismatches,

  • there'd be another figure here,

  • I'm gonna break it down into examples,

  • like, pseudo-examples of single mismatch guide target pairs,

  • even though that's not what it was,

  • and I'm just gonna do this,

  • 'cause I don't really have a choice,

  • 'cause there's just not enough data,

  • and then I'm gonna assume that they are,

  • for example, independent, and if they're independent events,

  • like, I can multiply them together and I can get a score.

  • But they're not really independent,

  • and it's not really the right model,

  • so we're actually going to use the very limited amount

  • of multi mismatch data with a really simple model

  • that has very few parameters to actually learn

  • to do something better than multiplying them together,

  • and that buys us a little bit extra,

  • doesn't buy us a lot, because there's just very little data,

  • so there's not a lot we can learn,

  • and if we had more data, we could probably do much,

  • much better here.

  • Does that make sense how I like,

  • I take CD33 single mismatch data,

  • I build just a single mismatch model,

  • which we hope is not specific to CD33, and it's clearly not,

  • 'cause we do get some generalization performance,

  • but to be clear, off-target prediction's actually not great,

  • it could, like, it's pretty low,

  • and then we break down examples into the single mismatch,

  • and we could, in principle, just multiply them together,

  • and that would actually not do that badly,

  • but we add on a layer of machine learning

  • with the very limited amount of multi mismatch data,

  • which we get mostly from guide seek data.

  • And we're doing all unbiased assays here for off-target,

  • 'cause we don't want to focus on kind of weird things,

  • so we're doing unbiased assays, like guide seek

  • that like, search over the whole genome,

  • even if they're not as sensitive.

  • And again, right, so we actually use

  • these boosted regression trees

  • for the single mismatch model, and then this is,

  • I said we have to use a very, very simple model here,

  • because we have so few examples,

  • and here we just use linear regression and actually,

  • you may not know what L1-penalized is,

  • but it's basically a super-constrained version

  • of linear regression, where you kind of force

  • or coerce the parameters to be closer

  • to zero than they would otherwise, yeah?

  • - [Student] For the previous slide, does it take

  • into consideration the applying (mumbles),

  • for example like here, you have it--

  • - Yeah.

  • - [Student] Whereas--

  • - Right.

  • So it doesn't do it in the sense

  • that we don't compute binding affinities of those things,

  • but we give the model that information

  • that it was a T to a G, and so if it has enough examples

  • of that to tease apart that that was useful,

  • then it could, in principle,

  • learn that kind of a thing, yeah.

  • - [Student] It just doesn't take into account the (mumbles).

  • - So, actually, so yeah, in this layer here,

  • where I said we'd do something that,

  • so one could multiply these together,

  • but instead we learn a model here, and that model, so,

  • and right, some of the other models out there

  • that were not machine learning take

  • into account how close together the mismatches were

  • and things like this, I think that's

  • what you're maybe asking, or--

  • - [Student] Actually,

  • so if a TC mismatch is in a TTT context--

  • - Yeah.

  • - [Student] On each side, it'll have a different--

  • - Yes, yeah.

  • - [Student] Stability than--

  • - Right, right, so we implicitly have the ability

  • to capture that, and the way we have the ability is

  • that when we encode this mismatch, one of the ways

  • in which we encode it is we say,

  • "There's a T to C mismatch in position,"

  • I don't know what that is, seven?

  • And then separately, we have features that say,

  • there's a T to C in position seven,

  • and so it can combine these things and learn,

  • it can put the pieces, like the Lego pieces together

  • to infer that potentially, if there's enough data,

  • but we don't do it more explicitly

  • than what I just described.

  • - [Student] 'Cause that was one of your,

  • the nearest neighbor identity was one

  • of your features in the on-target (mumbles), right?

  • - In one sense, like, so what did the features say?

  • - [Student] Well, just that, you go down the sequence

  • and you say, not just the G--

  • - Yeah, yeah.

  • - [Student] But GG in positions one and two,

  • and that was one of the features.

  • - No, you're right, actually,

  • we could augment the way we have featurized

  • the mismatch specifically to more directly

  • to encode that, and it might help.

  • In principle, if we had enough data,

  • it wouldn't help, but we don't have a lot of data,

  • so it might help, and so that is something we could try,

  • actually, that's a good idea, yeah.

  • Right, and so now we actually have more,

  • so this is in revision, we're hoping to submit

  • in the next few weeks, and we actually,

  • maybe someone in this room said,

  • "Please, like, get some more data

  • "and, you know, really validate this."

  • And so we started working with,

  • I'll mention at the end, 'cause I don't know how to,

  • Ben's, I'm trying to say Ben's last name,

  • (mumbles), and so we have some more data

  • and show that this generalizes beyond

  • what I'm gonna show you here, but I don't have

  • that in the talk yet, and so right,

  • there's this CD33 single mismatch data,

  • you can see there's about 1,000 guides,

  • and then this is sort of,

  • I think from the original guide seek paper,

  • and there's nine guides, up to six mismatches,

  • and so it says there's only 354 which are active in any way,

  • that is that the guide seek assay like, finds them at all,

  • and then so, if you just look at N total,

  • you think you have this huge sample size,

  • but those are just basically zeros, like,

  • non-active targets you get by searching the genome,

  • to end with this shortlist, like say, up to,

  • whatever, three or six mismatches, and so that's your kind

  • of candidate list, and then guide seek says,

  • "Well, I found activity at these very few."

  • So even though this N is huge,

  • this is tiny, and that is really important.

  • So we don't have, in a sense we don't really have

  • anywhere near that number, and then,

  • so there's a nice review paper from Max Haussler

  • and colleagues where they were comparing

  • on-target and off-target algorithms,

  • and they did a nice compilation

  • of some other unbiased assays that included guide seek,

  • and so we use as a separate dataset,

  • basically all of Haussler minus this guide seek,

  • and so those are kind of two different assays,

  • and those were all cutting frequency assays,

  • I don't remember the details.

  • But again, you can see here,

  • that once we subtract out the guide seek,

  • like, this is tiny, I mean, this is just really,

  • really tiny, and it's actually,

  • CD33 is the only really big dataset we have,

  • and it's kind of remarkable we can do,

  • get anything at all out of this.

  • Right, and so one question that, you know,

  • with the on-target, there's a whole bunch

  • of standard measures we evaluate things,

  • like the Spearman, you said, "What is the Spearman?"

  • And, you know, that's just sort of something like,

  • you just think immediately you do,

  • and you'll see this in all kinds of papers

  • in many domains, and if you're doing classification,

  • there's something called the ROC curve,

  • which we're not doing classifications, essentially,

  • but all that is to say, there's like, standard things

  • that people just do like this,

  • and Nicolo and I spent like, literally months trying

  • to figure out how to properly evaluate this,

  • because one of the problems is

  • there's a really inherent asymmetry here, right, like,

  • say, especially if you think of therapeutics,

  • but in any case, if I have a predictive model

  • and I have a guide target region,

  • and it's truly, it's active there, like,

  • it's going to be active and it's going to disrupt things,

  • my model says that it's not active,

  • that's the kind of mistake it is,

  • then that's a really bad mistake, right,

  • I'm gonna use that guide and something bad's gonna happen,

  • like, someone's wet lab experiment's kind of messed up,

  • or I'm gonna kill someone.

  • And on the flip side, if my model,

  • if there's some region, (mumbles) some target region

  • that it's not active, but my model makes a mistake

  • and says it is active, like, well, that kind of sucks,

  • 'cause I'm not gonna use a guide

  • that might have been a good guide,

  • but there's not some sort of devastating consequence,

  • right, and so there's this inherent asymmetry,

  • and we wanted to take that into account

  • when we're evaluating things,

  • because we want to evaluate in a way

  • that's most meaningful for what people care about,

  • and so we ended up, you know, we're aware of this asymmetry,

  • but we didn't, also, but where, you know,

  • where in this sort of asymmetry space do we lie,

  • how do we kind of weight those two extremes?

  • And so, in the end, this is what we decided to do.

  • We used a regular Spearman correlation here,

  • and this just turns out to be easier to visualize this way,

  • we basically, this is a relative plot

  • where this is an improvement over a CC top,

  • which is a, sort of an older model out there,

  • so it's the baseline, and then you can see,

  • basically, from the Broad, how well they do,

  • and let me explain what this knob is in a second.

  • But, so this is just Spearman R, oh,

  • this got moved around that block, well,

  • whatever, it's fine.

  • And now what happens is on the other extreme end,

  • over here, what is this?

  • It's also a Spearman, but it's a weighted Spearman,

  • so now each guide target pair is weighted

  • before I compute the Spearman,

  • and in particular, it's weighted by the activity.

  • So if a guide target pair is not active,

  • it falls out altogether, and that's

  • because of this asymmetry.

  • So that's the most extreme case

  • of the asymmetry I just described,

  • this is, if you're not active,

  • I don't even care if I get you wrong,

  • and what you can do is you can turn a knob

  • that smoothly varies from one end to the other,

  • and that's what this is.

  • It's a little complicated, the details I won't go into,

  • but you can see, it's smoothly variant,

  • it's not doing something crazy, and so the,

  • and what's really nice is there's no theoretical reason

  • for this, but almost invariably, when we draw these plots,

  • you find that one model systematically lies

  • above another model, and that's nice,

  • because it means that no matter where you are

  • in this regime, you should use like,

  • the one that falls on top of the other one,

  • that dominates it, and because it's actually a bit hard

  • to conceptually understand where you are in this space.

  • It's like I, we designed this, and I find it hard,

  • a bit hard to understand, as well,

  • but at the same time, I think it's a very useful metric.

  • And so you can see in this case

  • when we train, we always train on CD33,

  • and we never test on it, it's always the basis

  • of the single mismatch model,

  • and CD33 doesn't appear anywhere

  • in any of the other data, as far as I know.

  • And then here, we also add in the guide seek,

  • which has the multiple mismatches to do the second level,

  • and then we test on the Hauessler et al

  • that does not include guide seek,

  • and that's the relative performance of the different models.

  • Does the figure make sense in the motivation,

  • even if it's maybe not the kind

  • of thing you're used to seeing?

  • And we're not used to seeing it either.

  • And then you can flip it around,

  • we're now, again, we always use the CD-33,

  • but now we use the Hauessler combined data set,

  • and we test on the guide seek,

  • and you see a similar kind of pattern.

  • And again, we can drill down a little bit,

  • and this is actually just

  • from the CD-33 single mismatch model, oh, yeah?

  • Yeah, yeah, yeah.

  • - [Student] Just thinking about this,

  • in the previous slide.

  • I'm trying to think, does that,

  • on a 50/50 chance of getting this right--

  • - Yeah.

  • - [Student] Either

  • it looks like

  • guide seek helps Hauessler better

  • than Hauessler helps guide seek, is?

  • - Yeah, it's a little bit hard to, so let's see.

  • That looks like that.

  • Remember, these are improvements over CC tops,

  • and actually in the pre-print and in the paper,

  • it tells you the actual correlations,

  • which are pretty low, I forget now,

  • but I think they're maybe around 10 or 20%,

  • I don't, don't hold me to that.

  • The difficulty is that the number of active guides

  • in each of these, too, is very different,

  • the guide seek has about three times as many active guides,

  • so in some sense, it has a better chance

  • of building a good model that,

  • and so you have to be careful about things like that.

  • - [Student] That's an issue for you all the way along,

  • is what's the quality of the data?

  • - Absolutely, right, definitely.

  • And so one of the decisions we have to make is like,

  • when we're trying to evaluate how well things do,

  • we want to flip things around as many ways as possible

  • to show the underlying sort

  • of modeling approaches are solid, right,

  • than if you just tweak this or that, you still get this,

  • which is why we do both of these, but at the end of the day,

  • we need to say we're going to deploy a model,

  • do we train it on just the guide seek,

  • do we train it on the guide seek and the Hauessler,

  • and, you know, so this is partly like,

  • just an intuition we have by looking at things,

  • and we talked to John, and in the end, like,

  • the final deployed model is just trained on guide seek,

  • actually, plus the CD33 right now, but again, we have a,

  • we're trying to get a better one going, as well.

  • But yeah, absolutely, and I think in the on-target paper,

  • we actually have like, a matrix

  • where it says if you test on FC,

  • or train on the flow cytometry, test on the drug resistance,

  • how well does that generalize,

  • and like, we try to always do these kinds of things,

  • and Hauessler in his review paper has a super nice,

  • he did a very expansive set of comparisons

  • where he, across many datasets,

  • where he does this kind of thing,

  • if you take their model but you test on their data,

  • and all this kind of thing,

  • so there's kind of no end to this,

  • so you do as much as you can to think

  • that, you know, what you've done is solid.

  • - [Student] I guess on the previous slide

  • it was just a little, in the previous slide

  • I was just a little confused why you lose the difference

  • as you move towards the weight end--

  • - Yeah.

  • - [Student] Your activity in the 10 negative--

  • - So actually, we've, since, again, these should be updated,

  • I just, I like, literally flew here last night

  • after giving a talk at UCLA yesterday

  • and I haven't had a chance to update it,

  • but in fact, one thing I'm glossing over here is

  • what we realized, actually,

  • just during the current provision is that,

  • as you turn this knob over this way,

  • the effective sample size goes down, so for example,

  • over here, I've basically turned off anything

  • that's not active, and so those are falling out

  • of the computation, and a consequence of

  • that is that this is a very high variance,

  • like that estimate is not super stable,

  • so we now actually compute effective sample sizes

  • and truncate, and I'm pretty sure on our new plots

  • that this, it actually gets truncated something like here.

  • I have to go look, I apologize that I don't have it,

  • but, and so all I can say is,

  • I don't know specifically why it takes these shapes,

  • what I do know is that as you go more and more to the left,

  • the effective sample size goes down,

  • and we should consider that, which we hadn't, up until now.

  • Right, and so this, again, this is just sort

  • of groups of features and how important they were.

  • And so this here, unsurprisingly, is there's a feature

  • that says, you had a T to A in position three,

  • that's what this is, mutation identity plus position.

  • So that, everything together, you tell it the position

  • and the identity, and what it went from,

  • and what it went to, so that's actually super,

  • super important in the model,

  • and then if you tell it just the mutation identity,

  • but you don't tell it the position,

  • so you just say it was a T to an A, but I don't know where,

  • then that also turns out to be helpful,

  • and if you tell it just the mutation position,

  • you say, I don't know what the mutation was,

  • but it was in position three, then that's also helpful,

  • and then this is translation versus transversion,

  • is that what it's called?

  • Transversion versus transition. (laughs)

  • Yeah, you can tell me what that is,

  • what we put it in, I guess that's,

  • what is it, like a C to, A is a what?

  • - [Student] C to T is transition,

  • T to G is transition, everything else is a transversion.

  • - Alright, so I guess we read somewhere

  • that this might help.

  • But what's interesting is this sort

  • of drives home a point I was mentioning earlier is,

  • for every example, if I tell it the mutation identity

  • and the mutation position, but I tell it sort

  • of those two separate features,

  • then these regression trees, in principle,

  • they can combine those two things

  • and make up this feature.

  • So if I had a lot, a lot of data,

  • I wouldn't have to give it this feature,

  • but through experimentation,

  • we can see we don't have enough data

  • that it doesn't actually do that

  • and so that's why we give it all three of these,

  • and they all turn out to be important.

  • And so this, this is, I don't know,

  • for me, as a computational person,

  • this is interesting, I don't know if it will be for you,

  • but this is an example of like, different,

  • we wanted to show that kind of different components

  • of what we've thrown in the mix here help one

  • on top of the other and that we're not doing things

  • for no reason, and again, just for,

  • for visualization purposes as relative to CC top.

  • So I've mentioned that,

  • so red is this final model we deployed,

  • which is the one I described to you, and let me,

  • what happens though if instead of,

  • remember I said there's a second layer model,

  • where instead of just multiplying together

  • the single mismatch things, I like,

  • I add in this extra machine learning,

  • if I don't add in that extra machine learning,

  • then I end up on this blue curve,

  • and so you can see that extra level is helping,

  • and probably if we had a lot more data,

  • it would help a lot more.

  • And now what the one below it is,

  • it says, what if we do the same thing as

  • what I just described, the blue curve,

  • but instead we actually use just the implicit features

  • from CFDs, so CFD was this what John and guys had to do,

  • and basically, they don't pitch their model

  • as a machine learning model,

  • but as a machine learning person,

  • you can look at it and you can say, "aha."

  • The feature that they used, the features they use,

  • it turns out they use one feature,

  • and the feature they used is just,

  • there was a T to A in position eight,

  • that's the feature that implicitly they're using.

  • If you use just that feature, you actually do pretty well,

  • right, like, that's actually driving a lot of the signal,

  • but again, you buy more by adding in these other features.

  • And now finally, if instead of doing regression,

  • where I'm predicting like, ranks or like,

  • real values, I actually just say,

  • "Was it active versus not?"

  • So I pick some threshold, and this is actually

  • what most other people do,

  • then you get actually terrible performance,

  • and this shows you that you're throwing away a lot

  • of information.

  • Right, so that's, so now, so far what I've showed you is,

  • I'm gonna pick a guide, I'm gonna filter using up

  • to some number of mismatches to get a shortlist,

  • and then for every potential off-target in the shortlist,

  • I've now shown you how we've trained the model

  • and how it fares compared to other models

  • that do similar things, and now I want to tell you,

  • how can we aggregate this into a single number?

  • And, you know, right, this is kind

  • of complicated in the sense that it depends

  • on your use case and all kinds of things,

  • and, but, you know,

  • I think having something is better than nothing,

  • and people can ignore it if they want,

  • but people seem to want it.

  • And so how do we take all the scores

  • that result, for this one guide from step two

  • and aggregate them into something meaningful?

  • And other people have done this, also,

  • but not with machine learning.

  • And so what we're gonna do is the following,

  • we're going to take this one guide,

  • we're gonna scan the genome for the potential off-targets,

  • and, you know, I can't remember,

  • maybe this is like, three, four, 5,000,

  • I'm not sure anymore, for each of those,

  • we're gonna apply the model I just told you about,

  • which takes the single mismatches

  • and combines them with this M function

  • into one number for that guide target,

  • as sort of something like that probability of activity,

  • let's say, and I do this for a whole list

  • of guides, sorry, of targets.

  • Of course, if I did a different guide,

  • I would have a different list and a different length

  • of list probably, as well, and now I can look,

  • I can think of this as having some distribution

  • of off-target activity, as well as how many there are,

  • right, and so now I can say, well,

  • can I do machine learning on that?

  • But now the question is,

  • what is the supervisory signal here, right?

  • Like, so before it was guide seek and things like this,

  • so what's my aggregate measure,

  • and so, again, this is not, so, you know,

  • John said, "You should try this viability data

  • "and the non-essential genes."

  • And so the idea is that if I target a non-essential gene,

  • so my desired on-target is in the non-essential gene,

  • then if I have no off-targets, then the cell will survive,

  • and it will, you know, it'll be viable,

  • and to the extent that there are off-targets,

  • it may not be viable, and so,

  • and so the question is, why is it not viable?

  • And so the reviewers have complained a little bit

  • and said, "Well, really what that assay is capturing,

  • "is it's not capturing some sort

  • "of sum total effect of off-target activity,

  • "it's basically capturing,

  • "was one of your off-targets essential or not?"

  • And then the reviewer said,

  • "There's actually so few essential genes

  • "that basically like, this is meaningless,

  • "because how can you get anything?"

  • And so it's true that if you hit an essential gene

  • as an off-target, it will die,

  • but it's also true that there are so few of those

  • that this is not what's driving it,

  • and there's now actually three different papers out,

  • I don't, did I add them, the citations,

  • there's three papers out that show,

  • probably by people in this room, I don't know,

  • that sort of the more cuts you get,

  • the more likely the cell is to die,

  • and we now have experiments that show

  • that that is, well, rather, we have experiments

  • that show it's not knocking out essential genes,

  • it's driving this experiment.

  • Like, we can show that because we can actually insert

  • that information in and show that it doesn't help any,

  • and that's not surprising, because there are actually

  • so few, I think it covers like,

  • 2% or less of the genome, sort of, by sequence.

  • Right, and so we use that as our supervisory signal,

  • and then we again, play this game

  • where we partition the data, we train,

  • and then I say, "How well does it do?"

  • But again, there's very, very little data

  • to do that, and so we use, again,

  • a very simple model, which is called,

  • it's based on the near regression.

  • And when you do that, oh yeah, right, so.

  • Right, so we do this with two data sets,

  • Avana and Gecko, and again, just on the non-essential genes,

  • and in the, I forgot to say, so we have this distribution,

  • so what kind of features do we use from this distribution?

  • So I think we have sort of the number of things in it,

  • and then we somehow will compute things,

  • like, what's the average, what are the core tiles,

  • and then we might say,

  • "What are the average in just the genic off-targets,

  • "the average score from our model?"

  • And things like this, so it's incorporating

  • what's genic versus not,

  • and we can put in the essentiality there,

  • which is not helping, and then something

  • also the reviewers have asked for is like,

  • what if you, in this whole, you know, off-target thing,

  • you incorporate the chromatin accessibility?

  • And so that turns out to be very difficult,

  • it actually does help, but it's really hard

  • to get the right cell type, and so even though it helps,

  • we're not right now putting it in our model,

  • just because for almost all the data we have,

  • we can't even get chromatin accessibility data.

  • So in principle, that would be nice,

  • but it's a bit too early to do that, I would say.

  • Right, and so when you do this,

  • and now, so on this dataset, in particular,

  • it's pretty low, this final aggregation score,

  • but it's certainly, it's actually highly significant,

  • and you can say that this is our aggregation model,

  • and I didn't say this explicitly,

  • but to the extent we do well on aggregating means

  • that we also did well on the individual score,

  • guide, target scoring, because if we did wrong on that,

  • then you would have no hope of doing the aggregations.

  • So aggregation is some sort of summary measure,

  • in a sense, of how well we did on those,

  • plus how well we combine them together,

  • and if you did either of those things wrong,

  • so if the previous model I told you

  • about was just like, bogus,

  • you would get nothing out of this.

  • And then CFD is actually John's way of combining stuff,

  • it shouldn't say website, it's not even available

  • on the website, it's just in the paper,

  • and then this is from the MIT website.

  • And then you can flip it around again,

  • so this was right train on Avana, test on Gecko,

  • and then you can flip it around,

  • and you see, actually, right now,

  • the correlations go up a lot more,

  • so you get stuff more around .15,

  • then you see a similar ordering of things.

  • - [Student] How long until MIT takes the website down?

  • - I don't know, is anyone in the room from that group?

  • No, okay, I'm curious to hear if they,

  • I mean, yeah, we should talk to them.

  • Right, and so we want to put this all together now,

  • and so this, this is very cumbersome,

  • so if you look at the available online tools

  • for off-target activity,

  • the way they work now is you request some information,

  • say, about some gene, and then you wait like,

  • five, 10 minutes, and it like, emails you something

  • and it's sort of, it's a bit, it was annoying

  • for us to do evaluations, like, we had to get,

  • you know, someone to program something that would ping it,

  • wouldn't ping it too much and would wait,

  • and then stuff would time out,

  • and then these kinds of things.

  • And we wanted something where people would just go,

  • and it was actually pre-computed,

  • and because I work at Microsoft,

  • we have a whole lot of computational resources,

  • and so it's kind of fun, sometimes,

  • to make use of them, so we decided we wanted to,

  • and actually, when we first told John we were gonna do this,

  • he was like, "That's impossible, you can't do that."

  • And so we said, "We want to pre-populate,"

  • for starters, now, it's just the human genome,

  • for every coding region, for Cas9,

  • every possible on-target and off-target score,

  • as well as be able to drill down

  • into all those off-target scores,

  • and so we actually, for the first revision,

  • we populated that, although it wasn't,

  • for the first submission, and we're now tweaking it,

  • but that takes roughly, if you have,

  • we run for about three weeks straight on 17,000 cores,

  • so which, probably no one here could do

  • that with wherever you are, so it's kind of,

  • sometimes it's fun to be at Microsoft

  • to do that kind of stuff.

  • And so we're building this site,

  • and we've tweaked it now a bit based

  • on the revision, but basically,

  • there's gonna be different ways you can search,

  • but based on gene, based on transcript,

  • and then it'll search through anything that satisfies Cas9,

  • and then you can like, click on the guide

  • and drill down into the off-target activity,

  • and we'll give you some sort of basic information,

  • like, was this off-target in a gene or not?

  • We'll also give you the position in chromosome

  • in case you want to like, look up other things.

  • And, yeah, so that's unfortunately not live right now

  • because one, we've been modifying things

  • and we're now repopulating it,

  • and then once the revisions end,

  • like, hopefully, and this is full,

  • then hopefully, very soon, maybe in the next few months,

  • I hope this is publicly available,

  • and if in the mean time you want to play with it,

  • we might, I have to talk to my collaborators,

  • we might be able to give some people private access.

  • Yeah, sorry.

  • (background noise drowns out student speaker)

  • Oh, you know what's really funny, is.

  • Yeah, the funny thing is, I think that,

  • I have this, after we submitted it and it came back,

  • we realized we had something flipped around,

  • and now I can't remember if like, on this example,

  • what is what anymore, and this will be made clear

  • when it's released, and I don't--

  • - [Student] It never is, on a lot of sites, (mumbles).

  • - I see.

  • (student mumbles)

  • Oh.

  • Yeah, so this has little tool tips,

  • like if you hover above the question marks,

  • you get information, and one of them will tell you that,

  • and I just don't remember, and I know we,

  • we didn't, we had it wrong, not in the sense

  • that anything in the paper was wrong,

  • just that like, something on the website was like,

  • the wrong direction, yeah.

  • And so, right, and none of this would be possible

  • without wonderful set of collaborators, and.

  • And very much Nicolo, who's my computation collaborator,

  • who has an office next door to me,

  • and John at the Broad, who's just been wonderful,

  • and the three of us have a bunch

  • of projects on the go right now,

  • and very recently, we hooked up with Ben and Keith

  • to generate some more guide seek data

  • to validate these off-target models, and Michael Weinstein,

  • who, I think he actually just left UCLA,

  • he actually did all of the coding infrastructure

  • for basically that whole search like,

  • if you give me a gene or a transcript,

  • like, find all the possible regions

  • on the genome up to this number of mismatches,

  • and then he pipes that into our machine learning,

  • and so those two things work together,

  • and then some folks at Microsoft, too,

  • help with the website and the population

  • of the database and all these things.

  • So I think, yeah, that's it, thank you.

  • (audience applauds)

  • - [Student] How well does your model work

  • for other (mumbles) systems, (mumbles).

  • - We haven't yet looked at I mean,

  • other CRISPR, I mean the question, like,

  • how well does it for other CRISPR,

  • how well does it work for other organisms,

  • I mean, there's, and just,

  • we don't know yet, other than in Haussler's review,

  • he did something interesting where he basically said,

  • "Our model was state of the art,

  • "as long as it was," I mean, he compared like,

  • a whole bunch of organisms, like mouses,

  • zebrafish and all of these things,

  • and the zebrafish, we did really badly on,

  • but he could show that it's not necessarily,

  • it's unlikely to be because of the organism,

  • and it's more likely the difference between in vitro

  • and in vivo, and so actually,

  • we now have privately retrained in vitro models,

  • and he's taking a look at those, so,

  • and I'm not, I don't know even almost what that means,

  • someone told me the biochemists

  • in some other community even don't even use these words

  • in the same way, and so that, like,

  • someone will yell at you depending how you say them,

  • so all I know is that, like,

  • that's a dominating factor over organism,

  • and that there seems to be, like, this is just me,

  • like, you know, I can't rigorously point

  • to a picture, but sort of my memory

  • from all the things I've seen is

  • there's pretty decent generalization between organisms,

  • but I'm sure that if you have data

  • for a particular organism, that,

  • you know, probably that will help.

  • - I think one of your limitations is the

  • only information you have

  • to develop your features is the sequence

  • of information, and you mentioned chromatin.

  • - Yeah.

  • - To the extent that nucleus cell location,

  • compaction, has an influence on any particular target,

  • that's completely outside--

  • - Yeah, absolutely, yeah, yeah.

  • - There are data out there,

  • but there may be more cell-specific--

  • - I see.

  • - That you'd like to have--

  • - In fact, so when Ben and Keith were about

  • to generate this new data, like,

  • we decided together, like, how to do this

  • and as a, and I said, "Can you do any of these cells

  • "that have chromatin?"

  • 'Cause none of our guide target data has chromatin data.

  • We ended up having to do it with the viability data

  • at the aggregation step, and he said like, basically,

  • basically, no, like, not because he was unfriendly

  • and uncooperative, just for whatever restriction they had,

  • like, he, even then, as we were generating new data

  • and wanted to, we couldn't there,

  • but if, I mean, if you guys know

  • where there's good data for us to use,

  • or you're generating some, and it's out at some point,

  • feel free to let us know, but yeah,

  • this is a big, I mean, a mountain of data,

  • and then these kinds of auxiliary datasets,

  • both stand to really improve these models a lot.

  • - That's quite an issue of, you may have some features

  • that you could discard that, you know,

  • the machine could discard them,

  • but you're also emphasizing features that don't--

  • - Absolutely, absolutely, undoubtedly, yeah, there is.

  • - Would you mind going back to some of the on-target,

  • just one thing that struck me that,

  • the importance of the melting temperature

  • in nucleotides 16 to 20 is more important

  • in this histograms than the melting temperature

  • at eight to 15, three to seven,

  • and if I remember correctly,

  • so is 16 to 20 next to the PAM, or is it--

  • - [Jennifer] Yes, it is.

  • - [Man With Gray Hair] Those are the ones next to the PAM,

  • so that makes sense, okay.

  • - Okay, good, phew. (laughs)

  • - So, I mean, one of the things,

  • I grew up as a nucleic acid biochemist,

  • and one of the things I would like to know is,

  • what physics features are being reflected here,

  • even to the extent the nucleotide sequence is predictive,

  • how can we understand this in terms

  • of thermodynamics, interaction with,

  • between the DNA/RNA hybrid and Cas9 protein,

  • what atomic level features are reflected?

  • - Right, and I think it is possible to go in that direction,

  • it's just that you don't get it for free

  • by doing what we did, it could also be

  • that by doing that, in fact, we somehow improve what we did,

  • because we are, have limited data

  • and maybe, like, you know, the thermodynamics,

  • in principle, if we use a rich enough model,

  • it should just compute the thermodynamics if it's necessary,

  • right, but it's obviously helpful for it,

  • that we've told it the thermodynamics.

  • So if there are other physics based things,

  • we could one, go out of our way to test

  • to see how important they are,

  • with all the caveats I've mentioned,

  • or we can add it in and say,

  • "Does it also help the model?"

  • So if you, I mean, if there are things

  • that are particularly interesting

  • to the community that we are gonna,

  • readily computable, we are, you know,

  • we can throw them in.

  • - [Student] Seems like an obvious when we look at,

  • we were looking at the crystal structure,

  • you know, two days ago, but it looks like it goes

  • to this intermediate confirmation state

  • where it's testing whether it's got,

  • already matches the DNA, and it's kind of doing,

  • and so once we understand a little bit more

  • about what's happening--

  • - I feel like that's, I might be wrong,

  • but I think that's why it got broken up this way,

  • is that there was this sort of one phase

  • where it was like, feeling things out,

  • and then if it was okay, it would sort of keep going

  • to the more and more, but that is not my area,

  • but I have a big memory of that. (chuckles)

  • - [Student] When you consider the sequence,

  • I think that, because for the,

  • single guide (mumbles) scaffold,

  • if you have different sequence

  • of this scaffold, it seems different

  • that this (mumbles) can perform a single structure based

  • on space around, spacing sequence,

  • so why not make it smaller, like,

  • are you relying on single like, scaffold sequence, or--

  • - What, can you tell me what the scaffold is?

  • (student mumbles)

  • I see, and is that, I like, this is now getting,

  • like, at the limit of knowledge of the (mumbles),

  • and that's not constant across all these experiments.

  • - [Student] Oh, I think people now, like,

  • most (mumbles), the student using singulars--

  • - I see, I see, I see, okay, so probably it's the case

  • that all the scaffolding was constant here,

  • and therefore, we didn't model it,

  • but if people were changing the scaffolding,

  • then this could, and presumably it is important,

  • which is why people are changing it,

  • then absolutely, we could put that into the model, yeah.

  • - [Student] I think one thing she's getting

  • at is that the target, the spacer sequence

  • that you guys are changing can mess up falling

  • to the rest of the (mumbles),

  • so it compliments with different parts of the RNA,

  • so that might be--

  • - But it's still, if you're not, even if that's the case,

  • if in the data I have, I'm not changing that,

  • then I still don't need to put it in there,

  • it should be able to just figure that out,

  • because if something's not changing,

  • it's not informative, like, by definition, yeah.

  • (slow ambient music)

(upbeat ambient music)

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it