Subtitles section Play video Print subtitles Chris Cotsapas: I’d like to thank the organizers for the opportunity to come talk to you guys about what we’ve been thinking about in my lab. So, what I’m going to talk about primarily is stuff that’s going on. So, all of this is unpublished. Feel free to think about it, share it, whatever. But it’s very much work in progress. Some of it is hot off the press. So, do take it with a pinch of salt. So, what we think about a lot is autoimmune diseases in my lab. And we kind of want to think about which genes go wrong in disease, and we think about these regulatory genes. But actually what we’re interested in are the causal genes. And my pointer doesn’t work. I can use this pointer. It’s all coming up Chris today. So, we’re thinking more about causality than anything else. So, when we say dysregulation, we’re interested in pathogenesis, right? That’s ultimately what we’re after. And so, just a 30,000 foot view of the immune system. If you remember, you start with a stem cell. You have two major lineages in the immune system, that the lymphoid and the myeloid lineages. So, things like macrophages are all the way down here. And your T cells and B cells are all the way down here. If you think of you think of them as adaptive versus innate. And what happens is every now and then, this goes wrong. So, the immune system’s primary function is to protect the body from things that are foreign. And so it’s got this amazing capacity to tell the difference between your cells and the rest of world. And it’s really good at this, but occasionally it screws up. And it kind of -- what happens is that it starts attacking certain tissues. So, if it doesn’t like myelin, you get multiple sclerosis. The immune systems manage to go into the brain and attack the myelin sheath [spelled phonetically] very specifically around neurons, chew it up, and you get lesions into your brain. You can get things like skin attacks which give you Sjogren’s syndrome, scleroderma, you can get type 1 diabetes, which we now know is an immune disease. If it doesn’t like aspects of the GI tract, you wind up with Crohn’s disease, ulcerative colitis, or celiac disease if it doesn’t like the epithelia joint; specific joint dislikes, should we say, give you rheumatoid arthritis or ankylosing spondylitis. And if it just doesn’t like DNA, if it doesn’t like nucleic acid, it attacks everything, then you wind up something called lupus, right? What’s really interesting is that these are very, very, specific dislikes. So, MS is not rheumatoid arthritis. It’s a very specific attack against myelin. It’s not a specific attack against anything else. And what we really want to understand is what these diseases are. So, something’s going wrong with the immune system. We don’t really understand what it is. What we do know is that all of these diseases are common. They’re complex genetic diseases. There’s a large portion of heritability. They track in families. But they’re not Mendelian. It’s not one catastrophic mutation, right? And, of course, as GWAS came along, I’m going to talk about multiple sclerosis, which is something that I work on. But you can take this as read for any immune disease. As GWAS came along, we hadn’t really gotten a lot of traction on the genetics of these diseases. And then, sort of we barely managed to identify two loci in the genome in one of the first GWAS studies. Then a little while later, we managed to get another one. A meta-analysis of these two sets of studies from international consortia kind of gave six new hits, and we’re starting to climb this power curve of discovery. Then a further meta-analysis with more markers and a few more samples gave us an additional three new hits. Even more samples gave us another 25 new hits. The immunochip gave us 47. That took us up to 100. And our current studies, which are about 16,000 cases, 26,000 controls and replication in another 36,000 samples, we’ve got another 100 odd new hits. So, we’re standing at around 200 loci right now in GWAS, right? That explains -- including the HLA -- it explains about 55 percent of the heritability. We estimate that in the common space there’s probably another 600 to 800 loci that we don’t know about yet. We kind of do know about them. They’re not genome-wide significant yet. But we know they’re there. And we know the approximate complexity of the disease is about 1,000 independent variants. And so, when ENCONDE came along and we did -- we were a very small part of this paper from John Stam sort of showing that in Crohn’s disease and in multiple sclerosis, there is strong enrichment of the risk SNPs on regulatory regions active in very specific subsets of the immune cells. And in multiple sclerosis in particular, you can see CD3 cells, CD19s, B lymphocytes, and CD14s, which is interesting. There’s a lot of pathogenesis coming out of T cells as well. But these are more B cell like. And so, dysregulation in multiple sets of immune cells seems to be an issue here. But this kind of sends us chasing down this idea that is now extremely common. And this is one of the great right, right? So, 10 years ago GWAS wasn’t going to work. And five years ago, everyone was asking why we haven’t solved disease yet. Five years ago, everything was coding. And now, everything is now regulatory. And it seems really obvious. But even two, three years ago, this was not that obvious. And so, this chases us down -- starts us chasing down this rabbit hole of which genes are getting dysregulated and how does that cause disease. And so, that’s what we are going to talk about today -- further evidence that in specific immune cells, you get dysregulation that maps into specific transcription factor binding sites as is from Kyle Farh and Brad Bernstein showing that the MS SNPs are particularly enriched for NF-kappa B transcription factor ChIP-seq peaks for instance. And so, there’s something that’s fairly specific dysregulation in immune cells, which is great in bulk, hard when you actually want to identify specific effects on specific genes in specific cells. And so, that’s the task at hand. And so, when you look at some of the loci, you know, you put up a GWAS locus. Here’s a classic locus in MS. Well, there’s NF-kappa B one and mannose-binding protein A. And you could sort of make a case for mannose-binding protein A, but really everyone’s going to assume that NF-kappa B one is one is the appropriate gene. And it turns out that that’s right for various reasons. And so you can start working on that because you kind of are reasonably sure that’s the gene. When you look at another locus of course, that gets a lot more difficult. You’ve got this big association peak. There’s a bunch of genes in here, and the problem isn’t that they’re not good candidates. There’s a bunch of good candidates in here. ORMDL3 is here. IKZF3, which is Helios, which is a transcription factor that controls T regulatory cell differentiation is there. A bunch of other immune cells. And so, you’re kind of going, “What’s going on here?” So, we kind of thought, “Okay. If there is regulation, and we have SNPS, how do we unite the genetics with the epigenomics?” And a lot of people are thinking about this. You’re going to hear a lot more stories about this. You’re already heard some. Here’s how we’ve been thinking about it. So, we’re kind of amateur math geeks, and so we thinking about how we can transfer some of this probability and do some functional fine mapping. So, you have a set of SNPs in the genome. We’re going to talk about hypersensitive sites now. But instead of DHS, you can think of any regulatory mark. We’ve been working a lot with hypersensitive sites because we like them. They’re stable. They’re nice. They tell you a lot. We’re going to expand this to the other sets. But think about DHS for now. And you’ve a gene in the locus. So, this is my like tiered view of a locus. So, each of these guys is associated to disease. And -- oh, this is going to chop off my -- thanks. Oh well. So, what that says is posterior probability of association or PPA, okay? So, when you do a GWAS for each of these SNPs, you get a P value of whether it’s associated to disease or not. You can convert that simple P value into basically a posterior probability which tells you, what is the likelihood that this SNPs is the one driving the signal, okay? We’re not going to talk about the math magic that underlies that. I’ll bore you with it in person over a coffee if you like. But basically, for each of these SNPs, you can do a magical transformation and get the probability that that’s the SNP that’s driving signal. If it’s very associated, and nothing else is associated, it’s going to be really probable to drive the signal. If there’s a whole bunch of SNPs that are equally associated, you’re going to have to spread the probability that it’s caused all over all of those guys, right? That’s the intuition here. So, of course, some of these SNPs are actually on DHSs. And so, you can transfer that probability. I can’t even talk anymore, sorry. That probability to the DHS. You could also do something fancy like say this guys is about this far away from this DHS, so I’m going to give it some proportion here. That’s -- we’re not doing that right now. But basically, what I can do is come up with a way to score every regulatory region for what their probability of explaining what the association in that region is, right? And if I sum every one of those -- of course not every SNP is on those -- but if I sum all of these posteriors, that gives me the global probability that, in this locus, association is mediated by these regulatory regions. Doesn’t have to be all of it. But if most of the signal is on DHSs, then you’re going to get a high percentage, right? It’s going to be close to one. If it doesn’t look like it’s being mediated by regulatory regions, you’re going to get a low proportion. So much is easy. What’s cool is you can get think about how you correlate these guys to the genes they control. So, if I had a magic way of saying, “Well, this DHS is correlated to this gene this much, then I can wait how much of the posterior of association gets transferred into this gene, right?” So, if this guy’s perfectly correlated -- if this is what determines whether this gene is expressed -- then if this explains all of the association to a trait, then presumably, it’s active on this gene. Because the DHS isn’t just a DHS. It’s regulating something, right? So, that’s the intuition. And you partition this all this way. And what it says here is CP times PPA, okay? So, that’s just the correlation posterior between this DHS and this gene times how much weight you’ve given it from the association data. And that way, you wind up building this model of this gene posterior. So, if I sum all of these, all of the contribution of each DHS from the SNPs going into this gene, I can get a sense of what the probability that this gene is driving association in this region is. And I can do that for any gene. So, I now derive a score basically for how likely this gene is to be pathogenic, if that pathogenesis is mediated by DHS regions. And we know they’re enriched, so that’s a reasonable hypothesis, okay? It’s not the only way to do it, but it’s one way to think about this. And so, you have to solve a couple of technical problems to do this. One is, you’ve got to correlate your DHSs to your genes. And so, that’s really simple. You just observe if there’s a peak, and what the level of expression of a gene is, and then you do a correlation, on-off versus level of expression of a gene. And you do that for each DHS you find. Two issues. First of all, you’ve got to decide what the same DHS is. And secondly, you need measurements where you’ve measured both DHS and gene expression, okay? So, to do this thing, we use an alignment approach. This is what real DHS data looks like out of hotspot. These are peaks. This is an arbitrary part of the genome and your job is to figure which ones of these represent the same element across samples. We’re not terribly good at that as human beings. Fortunately, computers are a lot better at this than we are. So, you can put it in a clustering approach and kind of decide that these look the same that are a little jittered, but they kind of look similar. And then these guys are kind of the same, but you’re may be a little less confident because there’s more spread. And these guys are kind of the same as well, but there’s even more spread, okay? And the way we do this is with mark-off clustering. It’s a way to cluster stuff. There are other ways to do it. It work reasonably well. And the way you think about this -- oh, and that’s gotten chopped up as well. That’s brilliant. Okay. So, one way you might want to do this is to say, is this detectable? And so, you go into the Roadmap data, and fortunately there are replicates. And here’s my assertion. If I see a peak here in replica one of a tissue, then I should expect to see that peak in replica two of a tissue as well, right? Biologically replication just as we do in any other experiment. Really simple. And so once I decide this is my cluster, that’s what comes out of the algorithm, you don’t just go and apply that mindlessly to data. That’s not how you do analysis, right? You check and you see what you can detect. And of course, the wider and the sloppier this peak is, the less likely it is to be true. And so you can do a statistical test. And so, once you’ve decided what the cluster is, if there’s a peak anywhere in that cluster, you mark that sample as a one. And if there’s no peak, you mark it as a zero. If you have replicates where the labels somewhere over here on that wall, you can then say, “Okay, do I" -- "if I see ones in both replicates I’m going to score that tissue as a two. I’m going to score it as a one if there’s only replicate." So, if its’ discordant. "And I must -- I’m going to score it as a zero if there’s none there.” And then you can do a test. So, I’ve done this without knowing about replicas. And then I add the information about what goes with what and I ask, “Are they consistent?” So, if I get things like “Look, in cell type one, I get a one. And in two, I get a one. I get all ones.” That suggests this isn’t consistent. It’s not replicating. And if I get a lot of twos a lot of zeros and very few ones, that looks consistent. So, it’s replicating. It’s either not there or it’s there. And so I can do a statistical test. It’s not terribly important what the test is. It’s a simple chi-square approximation. We do this over 57 tissue replicates. So, from Roadmap. And we find that just feeding this in when we cluster, we can get about a million out of 1.99 million. So, about 54 percent of our clusters pass are fairly stringent threshold -- a fairly lenient threshold. And that’s because very often these things are kind of diffuse. The clusters don’t really look good. And so, we’re probably not doing great at the clustering, and it’s unreliable, right? There’s also a bunch of singleton in these data that get thrown out because they don’t replicate. But most of this is actually the clustering. So, we can get about a million features about the genome. And we don’t worry about recovering more stuff and improving the clustering. Right now, we’re just working with these million. So, these other thing is, you’ve got relatively low power. And so, what’s nice about this is this -- what you can clearly read here -- what you can do is estimate how much the heritability you’re still explaining. So, this is just a sanity check. If you use all of these clusters, it’s about 14 percent of the genome, and it explains a proportion of heritability. And what I want to know is if I reduce this to the half of the clusters that I’m using now, what proportion of heritability am I still explaining? And to a first approximation, what you can see here is in red is all the peaks and in blue is just the clusters that we define. Pretty much we’re capturing all of the signal. It varies as wiggle room. There’s a little bit of error on these things, but we’re capturing just about all of the heritability. But we’ve gone from 14 percent of genome to 8 percent of the genome. So, rather than do the 500 base pair either side, which is what most of previous heritability estimates have done, which a lot of the summary papers have kind of shown, “Oh, there’s enrichment in DHS or in regulatory regions or whatever.” But they actually bracket each feature by 500 bases. And so, they cover 50 percent of the genome. So yes, all of the heritability is explained by 50 percent of the genome. I’m telling you that a lot of the heritability’s explained by eight percent of the genome. So, it’s a lot bit more specific. And so, the second challenge is to now correlate these guys, now that we’ve decided what clusters are, to correlate them to gene expression. So, you need matched data. We use 22 sets of matched DHS and exon array data from Roadmap again. And the problem is, there’s massive inflation because gene expression data of course is highly correlated. And so you just get this massive inflation in the expected distribution of these tests. And we can correct this. We just go through and normalize it and basically, you kind of start off with this massive inflation. I’m showing you lambda here. It’s supposed to be a nice straight line here. And we can correct all of that out. So, now that we have all of these statistics, we can go back and do our little approach. So, now we have this part. We already have this part from credible interval, set mapping, and posterior estimation. And we can now estimate gene-wide scores. And so, big red exclamation point here you can see means this is really fresh, as in last Friday’s results. Hot off the presses. Here is a region. It -- we’re talking about MS GWAS. This is actually the immunochip data from 2013. Chromosome six. One megabase region. And I’m doing this for all of the genes in the region. DHSs explain 94.5 percent of the signal in the region. So, whatever it is, it is really, really likely to be acting through a DHS, right? MDN1, which is one of the genes in the region, explained 55.5 percent, not of this 94 percent, but of the total signal in the region. That’s how it feeds through, right? So, that’s what I’m doing. So, BACH2 is 16 percent. Between these two genes, you’d be hard-pressed to say that any of these other genes are really sort of pushing the signal, but it’s probably this one. So, this is a way to prioritize genes based on regulatory potential. Now, it’s really important to look at this number as well. If this number is low, you kind of think, “Well, it’s not really likely to be regulatory in the first place.” If it is, it’s going to be one of these guys, but it’s not likely to be. In this case, it’s really likely to be, right? So, if you look at another region, this region that I showed you before, the IKZF3 or MDL3. Ah, brilliant. They’ve chopped off. Okay, that read 0.029, 0.022, and they’re ranked and it goes down from there, okay? So, you’ll see that in this region, about 30 percent of the association signal is explained by DHS clusters as we’ve defined them, okay? So, it’s not a lot of it. That 30 percent is now basically smeared over a whole bunch of genes. There’s no one gene that explains that signal. So, even if you accept that I’m willing to take this 30 percent as a gamble, there’s no one obvious gene you look at it. And the reason for that is actually that we suspect that what’s going on in this region is there's an entire element -- sort of something like an accessibility element. Some people call them super enhancers. They can mean different things to different folks. What we suspect is going on is there’s an element that sets whether the locus is accessible or not accessible, and that affects the transcription of multiple genes. And so, what the effect may be is actually that you’re changing whether this entire locus is available or not available. And there’s a whole bunch of genes in there that then do different things and set a risk state or multiple risk states. And so, sadly it’s not always one locus one gene, but these are probable going to be really interesting. It’s unclear whether we can solve such loci. But they’re going to be really interesting. So, you’re going to get examples like this. And it doesn’t work all the time. This approach won’t work all the time because not all loci are simple in the one gene thing. So we’re going to have think harder about these. So, I’m going to switch gears in the dying moments and just give you another flavor of how we’re thinking about the other way around of epigenomics. So, so far we’ve talked about how to analyze these data and make inferences so we can then go and work on certain genes. But what we really want to know at some stage, if changes to gene regulation are what is creating disease states in the immune system -- well, you’re not born with an immune disease. Most of these diseases occur in the third, even fourth decade of life. So, what’s the risk state in immune states predisposed you to disease? That’s a hard question. That’s a really hard question. So, I told you before that you can see in multiple sclerosis a fairly enrichment in NF-kappa B binding sites that are near associated SNPs, okay? So, there seems to be something about NF-kappa B. And I also told you there’s an NF-kappa B one locus that harbors a lot of -- a very strong association. So, when you look at MS patients versus controls, if you look at CD4 cells, you find that in response to stimulus, in response to TNF-alpha, ex vivo CD4 cells actually signal much more strongly through NF-kappa B. And this is measure of phosphorylation of P65, which is one of the NF-kappa B subunits, okay? If you look at how inducible CD4 cells, how easy it is to activate CD4 cells from MS patients versus controls, you find that these are controls. The black circles, the filled ones are MS patients. You find that in general the CD4 cells are easier to activate through NF-kappa B. You can just hit them, and they’ll go. Correlation’s not causation. This could be an epiphenomenon of disease state. And so, what we did is we took this NF-kappa B one and we stratified people by genotype there. There’s no implication of causality for the SNP we used. It’s actually one of these really haplotypes that identical. We just used it to stratify risk, non-risk. And we’re looking at opposite homozygotes. And so, when you look at the three genotype classes without stimulation, this is your baseline -- I’m sorry it’s chopped off again -- but this is your baseline I-kappa B degradation. So, I-kappa B get degrade when NF-kappa B signaling starts. And what you see is a baseline that’s 100 percent. And by genotype, you find that there’s a different in how strong NF-kappa B -- I-kappa B degradation is, suggesting that there’s different amounts of signaling going on in these cells. If you do the obverse and look at the phosphorylation of the P65 subunit, again you see the same sort of thing, that this GG which is the risk state over-phosphorylates compared to the other genotypes suggesting that there’s more signaling through NF-kappa B happening for unit activation. That’s kind of interesting, but actually if you look at the expression by western -- so, this is protein expression -- if you try and quantitate how much P50, which is an NF-kappa B subunit you’re seeing, you see like a 20 fold increase in how much P50 exists. Just a baseline in these cells. What’s really interesting is that after activation, if you measure in nuclear localization of phosphorylated NF-kappa B, you see that there’s about a threefold change between, with the GG risk homozygote putting a lot of phosphorylated NF-kappa B into the nucleus following stimulus in CD4 cells, compared to the A. And so, what it looks like, for a given dose of stimulus, if you have the risk genotype, you signal a lot more, which probably does two things. It probably decreases the activation threshold to kick these cells over into an activated state. And it may also smear the phenotype that you see, because there’s so much transcription factor going into the nucleus that it’s activating everything, right? And we’ll talk about that in a second. This is not quite as simple as just as a single effect that the NF-kappa B one locus. If you look at the TNF receptors, there are two subunits. There’s a variant in the first subunit, TNFR 1, now called TNFSF1A. There’s a coding variant where, if you hit cells with TNF-alpha, you get different amount of signaling through the TNF receptor which leads to different amount of phosphorylation of NF-kappa B. Again, you’re getting different amount of signaling. I won’t belabor this. It also turns out there’s a whole bunch of other genes in the MS risk loci that are directly related to the NF-kappa B signaling. And so, I suspect one of the things that’s happening here is you’re getting this global effect on NF-kappa B signaling. It’s not that simple as just a linear effect. But there are multiple things that feed into NF-kappa B signaling at least in CD4 cells maybe in multiple other subunits that are kind of really setting the rheostat of how the immune system responds. And maybe that’s how -- partly how risk is determined. And so -- oh great, these are chopped off as well. So, here’s the model, right? Sort of with external stimulus, you get phosphorylation of NF-kappa B. NF-kappa B translocates to the nucleus and it does what it does best. It activates a bunch of its targets. It activates a transcription and that leads to activation, proliferation, and survival of these genes. Here’s what happened when you change this. If you increase phosphorylation, you’re going to get more NF-kappa B going into the nucleus. That’s probably going to activate its targets more easily. There’s probably spillover, right? NF-kappa B only activates a subset of its targets in any one given cell, or cell type. It’s got a bunch of other targets, which it doesn’t activate because the cofactors aren’t there, right? Transcriptional activation is a multi-cofactor process. If you’ve got enough excess, even though the kinetics of these promoters are bad, there’s going to be shuttling on those promoters, and you’re going to get leaky transcription. And so, this I believe. This is an assertion at this stage, right? But I think you’re going to get context and appropriate gene activation as a result of just putting a lot more of NF-kappa B into the nucleus. I showed you before, or right at the beginning, that there’s also risk variation that localize close to NF-kappa B binding sites in the genome. So promoters where those variants exist, you’re going to get differential activation of those promoters in a way that’s probably unrelated to the total amount of NF-kappa B. But that’s an additional modulation. And so, here’s what you can do. You can take a bunch of cells, take people who are risk variant homozygotes, and people who are non-risk homozygotes. So, people who will have different amounts of NF-kappa B in the nucleus. Hit them with TNF. In 15 minutes, you get signaling, so you measure how much phosphorylation you’re getting. In 30 minutes, you get translocation, so you can actually ChIP-seq, and see where NF-kappa B is going into the nucleus. Within two hours, you get gene activation in CD4 cells, so you can do enhancer mapping on RNA-seq, and see what’s changing in the regulation between these two groups. And within three days, you can get cell phenotype by producing the full activation stimulus and you can measure that by flow and actually see what these cells are doing. So, these are the level of experiments that you really need to do. And this is what we’re doing right now to actually see what the differential risk states are in various T cells. And I’m way overtime, so I’m going to stop. And I will just acknowledge a bunch of my colleague at the IMSGC, the International Genetics Consortium, a lot of partners including Brad and John, where we do a lot of these genomics things and people in my lab. Most of the causal mapping is from Parisa, a post-doc in my lab. All of the immunology is from Will Housley who is a fellow with David Hafler and with me. And I will stop and take a couple of question if I’m allowed. Thank you very much. [applause] Male Speaker: Great work. So quick question with regards to -- maybe you’ve seen Peter Schetury [spelled phonetically] describe MeVs a couple of years. So, multiple enhancer variants, where there’s multiple DHS or whatever -- Chris Cotsapas: Right, right. Male Speaker: -- measurement. And there’s a wide range there. You can have MeVs with only two DHS, with three, with four, with five. So, I’m just wondering how you take that into your account in your pipeline. Are the genes that you find at the top of your list biased toward risk loci -- or risk locus that have high MeV nature as opposed to those that have a lower MEV nature? And the second thing, there’s a large collection of risk locus that actually are singletons. They will only have like one DHS sites in them. How are you taking those into account? Chris Cotsapas: Right, so this is not about a single peak. This is about whether the peak is consistent across cell types, right? If there is only one peak in the entire collection of samples, and you don’t see it in the replicate, we throw it out. It’s a singleton in one sample. If I have two CD4s, and I only see a peak in one CD4 and never in any another cell type, I’m throwing that one out. Male Speaker: Right. So, that’s going to be true for one DHS site. But -- an MEV would have two of those or three of those. Chris Cotsapas: Sure. So, what we’re not doing is a combinatorics of the clusters yet. We’re basically not thinking about MeVs. But the correlation should still be there. It should be multiple of them correlating. So that gets naturally taken in to the correlation towards the gene, because all three of those DHSs should be correlated. Male Speaker: Right. They might not go to the same genes. So, you might have one risk locus, multiple genes regulated differently by different subset of their DHS. Chris Cotsapas: Right. But if they’re regulating different genes, then the risks that they’re imparting should only go to the genes that they regulate. Male Speaker: Yes. Chris Cotsapas: Does that make sense? Male Speaker: Yes. Chris Cotsapas: Because you’re trying to figure out which gene is being altered by whatever the risk effect is. And so, if DHS one is correlated to gene three, I don’t care about transmitting its risk quotient to gene two. Because it’s not -- there’s no evidence that it’s controlling it. Male Speaker: Great. [applause] [end of transcript]
B1 US kappa gene locus immune risk genome Identifying Dysregulated Genes in Autoimmune Disease - Chris Cotsapas 52 5 鍾佳芳 posted on 2015/11/04 More Share Save Report Video vocabulary