Subtitles section Play video Print subtitles MALE SPEAKER: Welcome, everybody, to one more Authors at Google Talk. Today, our guest speaker is Pedro Domingos, whose new book is called "The Master Algorithm." We have it here and you can buy copies outside. So one definition of machine learning is "the automation of discovery." Our guest, Pedro Domingos, is at the very forefront of the search for the master algorithm, a universal learner capable of deriving all knowledge, past, present and future, from data. Pedro Domingos is a professor of Computer Science and Engineering at the University of Washington. He's the co-founder of the International Machine Learning Society. Pedro received his MS in Electrical Engineering and Computer Science from IST in Lisbon, his Master's of Science and PhD in Information and Computer Science from the University of California at Irvine. He spent two years as an assistant professor at IST before joining the faculty of the University of Washington in 1999. Pedro is the author or co-author of over 200 technical publications in machine learning, data mining, and other areas. He is the winner of the SIGKDD Innovation Award, the highest honor in data science. He's an AAAI Fellow and has received the Sloan Fellowship and NSF Career Award, a Fulbright scholarship, an IBM Faculty Award, several best paper awards, and other distinctions. He's a member of the editorial board of "The Machine Learning Journal." Please join me in welcoming Pedro, today, to Google. [APPLAUSE] PEDRO DOMINGOS: Thank you. Let me start with a very simple question-- where does knowledge come from? Until very recently, it came from just three sources, number one, evolution-- that's the knowledge that's encoded in your DNA-- number two, experience-- that's the knowledge that's encoded in your neurons-- and number three, culture, which is the knowledge you acquire by talking with other people, reading books, and so on. And everything that we do, right, everything that we are basically comes from these three sources of knowledge. Now what's quite extraordinary is just, only recently, there's a fourth source of knowledge on the planet. And that's computers. There's more and more knowledge now that comes from computers, is discovered by computers. And this is as big of a change as the emergence of each of these four was. Like evolution, right, well, that's life on earth. It's the product of evolution. Experience is what distinguishes us mammals from insects. And culture is what makes humans what we are and as successful as we are. Notice, also, that each of these forms of knowledge discovery is orders of magnitude faster than the previous one and discovers orders of magnitude more knowledge. And indeed, the same thing is true of computers. Computers can discover knowledge orders of magnitude faster than any of these things that went before and that co-exist with them and orders of magnitude more knowledge in the same amount of time. In fact, Yann LeCun says that "most of the knowledge in the world in the future is going to be extracted by machines and will reside in machines." So this is a major change that, I think, is not just for us computer scientists to know about and deal with, it's actually something that everybody needs to understand. So how do computers discover new knowledge? This is, of course, the province of machine learning. And in a way, what I'm going to try to do in this talk is try to give you a sense of what machine learning is and what it does. If you're already familiar with machine learning, this will hopefully give you a different perspective on it. If you're not familiar with machine learning already, this should be quite fascinating and interesting. So there are five main paradigms in machine learning. And I will talk about each one of them in turn and then try to step back and see, what is the big picture and what is this idea of the master algorithm. The first way computers discover knowledge is by filling gaps in existing knowledge. Pretty much the same way that scientists work, right? You make observations, you hypothesize theories to explain them, and then you see where they fall short. And then you adapt them, or throw them away and try new ones, and so on. So this is one. Another one is to emulate the brain. Right? The greatest learning machine on earth is the one inside your skull, so let's reverse engineer it. Third one is to simulate evolution. Evolution, by some standards, is actually an even greater learning algorithm than your brain is, because, first of all, it made your brain. It also made your body. And it also made every other life form on Earth. So maybe that's something worth figuring out how it works and doing it with computers. Here's another one. And this is to realize that all the knowledge that you learn is necessarily uncertain. Right? When something is induced from data, you're never quite sure about it. So the way to learn is to quantify that uncertainty using probability. And then as you see more evidence, the probability of different hypotheses evolves. Right? And there's an optimal way to do this using Bayes' theorem. And that's what this approach is. Finally, the last approach, in some ways, is actually the simplest and maybe even the most intuitive. It's actually to just reason by analogy. There's a lot of evidence in psychology that humans do this all the time. You're faced with a new situation, you try to find a matching situation in your experience, and then you transfer the solution from the situation that you already know to the new situation that you're faced with. And connected with each of these approaches to learning, there is a school of thought in machine learning. So the five main ones are the Symbolists, Connectionists, Evolutionaries, Bayesians, and Analogizers. The Symbolists are the people who believe in discovering new knowledge by filling in the gaps in the knowledge that you already have. One of the things that's fascinating about machine learning is that the ideas in the algorithms come from all of these different fields. So for example, the Symbolists, they have their origins in logic, philosophy. And they're, in some sense, the most "computer-sciency" of the five tribes. The Connectionists, their origins are, of course, in neuroscience, because they're trying to take inspiration from how the brain works. The Evolutionaries, well, their origins are, of course, in evolutionary biology, in the algorithm of evolution. The Bayesians come from statistics. The Analogizers actually have influences from a lot of different fields, but probably the single most important one is psychology. So in addition to being very important for our lives, machine learning is also a fascinating thing, I think, to study, because in the process of studying machine learning, you can actually study all of these different things. Now each of these "tribes" of machine learning, if you will, has its own master algorithm, meaning its own general purpose learner that, in principle, can be used to learn anything. In fact, each of these master algorithms has a mathematical proof that says, if you give it enough data, it can learn anything. OK? For the Symbolists, the master algorithm is inverse deduction. And we'll see, in a second, what that is. For the Connectionists, it's backpropagation. For the Evolutionaries, it's genetic programming. For the Bayesians, it's probabilistic inference using Bayes' theorem. And for the Analogizers, it's kernel machines, also known as support vector machines. OK? So let's see what just the key ideas in each one of these are. So the Symbolists-- here are some of the most prominent Symbolists in the world. There's Tom Mitchell at Carnegie Mellon, Steve Muggleton in the UK, and Russ Quinlan in Australia. And their idea is actually a very interesting one. It's to think of deduction-- sorry, it's to think of learning as being the inverse of deduction. Learning is induction of knowledge. Right? Deduction is going from general rules to specific facts. Induction is the opposite. It's going from specific facts to general rules. So in some sense, one is the inverse of the other. And so maybe we can figure out how to do induction in the same way that people in mathematics figure out how to do other inverse operations. Like, for example, subtraction is the inverse of addition, or integration is the inverse of differentiation, and so forth. So as a very, very simple example, addition gives you the answer to the question, if I add 2 and 2, what do I get. Subtraction-- and the answer, of course, is 4. And this is the deepest thing I'll say in this whole talk. And subtraction, of course, gives you the answer to the inverse question, which is, what do I need to add to 2 in order to get 4, the answer, of course, being 2. Now inverse deduction works in a very similar way. So here's a simple example of deduction. You know that Socrates is human and you know that humans are mortal. And the question is, what can you infer from that. Well, of course, the answer is that, from that, you can infer that Socrates, too, is mortal. Now the inverse of this-- and that's when it becomes induction-- is, if I know that Socrates is human, what else do I need to know in order to be able to infer that he's mortal. And of course, what I need to know is that humans are mortal. And so in this way, I have just introduced a new general rule that humans are mortal. Of course, in general, I wouldn't just do it from Socrates, I would do it from Socrates and a bunch of other people. But that's the general way that this works. OK? And then once I've induced rules like this, I can now combine them in all sorts of different ways to answer questions that I may never even have thought of. And this kind of flexibility and composition is actually something that, of all the five tribes, only the Symbolists have. Now of course, these examples are in English and computers don't understand natural language yet. So what they use is something like first order logic. So these things, both the facts and the rules that are discovered, are represented in first order logic. And then questions are answered by chaining those rules, by reasoning with them. OK? But whether it's in logic or natural language, the principle is the same. OK? And as I said, of all the five paradigms, this is the one that is most like scientists at work. Right? They figure out, where are the gaps in my knowledge. Let me enunciate a general principle that will fill that gap. And then let me see what follows. Let me see if it's correct given the data. Let me see what gaps I identified and so on. And in fact, one of the most amazing applications of inverse deduction to date is actually a robot scientist. So if you look at this picture, the biologist is not the guy in the lab coat. The guy in the lab coat is actually a computer scientist and machine learning researcher by the name of Ross King. The biologist in this picture is this machine here. This machine is a complete, automated biologist. It starts out with basic knowledge of molecular biology, DNA, proteins, RNA, and all of that stuff. And then what it actually does is it formulates hypotheses using inverse deduction. It designs experiments to test this hypothesis using things like DNA sequences and whatnot. That's what's on there. It physically carries them out. So it does the whole process with no human help. And then, given the results, it refines the hypotheses, or comes up with new ones, and so on. OK? Now there's only one of these robots in the world today. Its name is Eve. There was actually a previous one called Adam. And Eve, last year, discovered a new malaria drug. And the thing that's amazing about this is that, well, once you've made one robot scientist like this, there's nothing stopping you from making a million. And I have a million scientists working on a problem where, before, maybe all we could afford was a few. So this can really speed up the progress of science. And I would say that in areas like molecular biology, there is no hope of really understanding, for example, very well, how cells work without this kind of thing. Right? There's just too much information for human beings to discover on their own. OK? Now this is one very, I think, attractive way to do machine learning, but of course, it's not the only one. In particular, the Connectionists are very skeptical of it. They say, well, this is too abstract. It's too clean. It's too rigid. Logic is not how the real world works. You have to do things more like human beings do. And of course, the way human beings do things is with their brains, so let's figure out how brains work, at least enough that we can take inspiration from them and build algorithms based on that. So this is what the Connectionists do. The most prominent Connectionist is Geoff Hinton. He actually started out as a psychologist 40 years ago and now he's really more of a computer scientist. And he's been incredibly determined in his goal of understanding how the brain learns. In fact, he tells the story that one day he came home from work very excited, saying, yea, I've done it. I've figured out how the brain works. And his daughter replied, oh dad, not again. And he's the first one to say that he's had some successes and failures. But I think the bottom line is that, these days-- for example, he's one of the co-inventors, of course, of backpropagation-- the successes far outweigh the failures. So definitely, the long-term research agenda has already given a lot of fruits. And it's going to give more. And two other prominent Connectionists are Yann LeCun and [INAUDIBLE]. Of course, Connectionism is also known as neural networks and, these days, as deep learning. But it's really all the same paradigm. OK? So how does it all work? Well, what we're going to do is we're going to build a model, a mathematical model of how a single neuron works. We're going to make it as simple as we can provided it's enough to learn and to do the inferences that we need it to do. OK? And then we're going to put these models of neurons together into big networks. And then we're going to train those networks. OK? And at the end of the day, what we have is, in some sense, a miniature brain, if you will-- of course, much simpler than the real one, but hopefully with some of the same properties. OK? So now a neuron is a very interesting kind of cell. All right? It's a cell that actually looks like a tree. There's the cell body, and then there's the trunk of the tree is what is called the axon, and then the branches are called dendrites. But where neurons get very different from trees is that the branches of one neuron actually connect back with the roots of another-- or in fact, with the roots of many others. And that's how you get a big network of neurons. And where the dendrites of one neuron join the dendrites of another, that's called the synapse. And to the best of our knowledge, the way humans learn everything you know is encoded in the strings of the synapses between your neurons. OK? So if the synapse is strong, then-- let me backtrack for just a second. The way this works is that the neurons communicate via electric discharges down their axons. They're literally an electric discharge called an action potential. And what happens is that if you look at all the charge coming into a neuron through its various synapses, if it exceeds a certain threshold, then that neuron itself fires. And of course, then it sends currents to the neurons downstream. OK? And the synaptic process itself involves chemistry and whatnot, but those details are not important for us. OK? So what's going to happen is that the learning basically happens when a neuron helps to make another neuron fire. And then the strength of the connection goes up. This is called Hebb's Rule. And as far as we know, this is how all our knowledge is encoded is in how strong the synapses are. If the neurons fire together a lot, then the synapses become stronger and it becomes easier for the first neuron to fire the second one. So this is the basic idea. Now what we have to do is, first of all, turn it into a model. That's not that hard. So the model is just this. What my neuron's going to do is it's going to do a weighted combination of the inputs. OK? Let's suppose, for example, that this was actually the retina. All right? So this is the input. Each one of these inputs is a pixel. And then each one of them gets multiplied by a weight that corresponds to the strength of the synapse. And if that weighted sum exceeds a threshold, then I get 1 as the output. Otherwise, I get 0. So for example, if this neuron is trying to recognize a cat, if this is a cat, then hopefully what happens is that this weighted sum will be high enough that this will fire and the neuron will say, yes, this is a cat. Now this is all easy. This goes back to the 1950s, and Frank Rosenblatt, and whatnot. The really hard, and interesting, and important question is, how do you train a whole network of these neurons. That's when things actually become very difficult. Because if you have a big network of neurons-- so here are my inputs. They go to one set of neurons. Right? These are the functions that they compute. Right? They're in purple here. And then, those go to another layer, and many, many layers, until finally, you get the output. Another question is, well, if the output is correct-- right? So let's say this was a cat and the network said that, yes, it is a cat, then life is good. Nothing needs to change. Right? Why go fixing what ain't broke? But what happens if there was an error? Right? This should have been firing, but wasn't. The key question is, what do I change in that whole big, messy network to try to make it give the right answer tomorrow. There is no obvious answer to this question. Because think of one neuron somewhere in the middle of the network. How is it responsible for the error at the output? All right? The error at the output could have come from an infinitude of different places. This is called the credit assignment problem. Or maybe it should be called the blame assignment problem, because it's deciding who's to blame for an error, and therefore, needs to change. And this is the problem that backpropagation solves. All right? And when people first came up with neural networks in the '60s, they didn't have this. It was when, finally, it was invented in the '80s by David Rumelhart and others that things really took off. And the basic idea in backprop is actually quite intuitive, I think. It's the following. Well, let's think of the difference between my actual output and my desired output. Let me call that delta. Right? This is the error. The output should have been 1, but it was 0.2. So it needs to go up. Another question is, what can I tweak in these weights to make it go up. All right? Well, a weight that is thinking at this last layer, well, at that last layer, the neurons with the highest weights are the ones most responsible for the result. And if this neuron is saying no but the answer is yes, well, then its weight needs to go down, because it's preventing it from firing. And if this one was saying yes but this is not firing, then its weight needs to go up. So what I do is I compute the derivative of this error with respect to the weights in the last layer. And then with those, I now have an error signal coming from these neurons. It's like, oh, you should have been higher. All right? You should have been higher in order to make this guy fire, because you have a positive weight. You should have been lower, because you have a negative weight and you're preventing it from firing. So now I know an error signal at the neurons at this layer. And I can keep doing the same thing all the way back to the input. And this is why it's called backpropagation. Because what I'm doing is I'm propagating back the errors and then updating the weights, changing the weights in order to make that error as small as possible. OK? Well, this is the backpropagation algorithm. It's what's at the heart of deep learning. And these days, this is used for just about everything on Earth. Right? Very early on, people used it to do things like predict the stock market. These days, you use it for search, use it for ad placement, use it for video recognition, use it for speech recognition, use it for things like simultaneous translation and whatnot. But I think, at least for the public, the best known instance of this is still Google's own famous cat network. Unfortunately, people think it's-- it got called "the cat network" by the journalists, right, because it happened to recognize cats very well. Actually, it wasn't an accident, right? This was all learned from YouTube videos, right, as you probably know. And people really like to upload videos of their cats. So there was more data on cats than on anything else. But maybe it should be called "the couch potato network," because it's based on watching a lot of video. And at the time, this was the biggest neural network ever built. It had, I think, on the order of a billion parameters. But of course, these days, there's much bigger ones and they're continuing to grow. So we'll see how far we can take this paradigm. Now the Evolutionaries, they say, well sure, but all you're doing is adjusting the weights on this model. You're just tweaking the strengths of the synapses. Where did that brain come from? That, in some sense, is the bigger question. That brain was produced by evolution. So maybe what we really should do if we want to learn really powerful things is figure out how evolution works. And we actually already have a good sense of how evolution works. There's a lot of details, of course, that are still to be understood. But at a high level, we understand how evolution works. So let us do the same thing on the computer. And the first guy who really ran with this idea was John Holland. He started in the '50s, '60s. And for a long time, this whole area of evolutionary computing, the joke was that it was John, his students, and their students. But then in the '80s, things took off and a lot of people started doing it. John Holland called what he did "genetic algorithms." And then John Koza developed a more powerful version called genetic programming, which we'll see in a second. And Hod Lipson is one of the people who are doing very interesting things with evolutionary learning these days, as we'll see in a bit. So what's the basic idea here? Well, how does evolution work? You have a population of individuals, each of which is described by its genome. OK? But in our case, in the computer, the genome, instead of being base pairs, instead of being DNA, it's just going to be bits, right, because 0 and 1, in some sense, are the DNA of computers. And then each of these individuals gets to go out in the world and be evaluated. It gets evaluated at the task that it's supposed to be doing. And the individuals that do better will have a higher fitness and will therefore have a higher chance of being the parents of the next generation. You get two very fit parents and you cross over their genomes just like people do. And so now you have a child genome that is partly the genome of one parent and partly the genome of the other parent, the mother and father genomes, if you will. And then you also have random mutation, right? Some bits just get randomly mutated because of copying errors or whatever. And then you have a new population. And what's remarkable is that you could actually start out with a population that is essentially random. And after you do some number of generations of this, you actually have things that are doing a lot of non-trivial functions. Like, for example, you can evolve circuits that-- you can develop radio receivers, and amplifiers, and things like that in just this way. And they often work better than the ones that are designed by humans. In fact, people like John Koza, they have a whole bunch of patents that were actually-- the patent designs were actually invented by the genetic algorithms, not by the humans. OK? Now Koza's idea was actually to take this one step beyond. And that's the following. Well you know, these strings are a very low level representation. And just cutting a string in the middle to cross over is probably going to muck things up. Right? We are trying to evolve programs, right? At the end of the day, that's all we're trying to do is evolve programs. So why don't we actually work directly with the programs themselves? And a program is really a tree of subroutine calls all the way down to simple operations like additions, multiplications, and ands, and ors. So let's represent the problems as trees. And then in order to cross over two programs, what we're going to do is we're going to randomly pick a node in each of the trees and then we're going to swap the sub-trees at those nodes. And now we have two child programs. OK? So for example, here, if I do the crossover at this point, then one of the sub-trees that I will get is the one that's all in white. And this tree actually represents one of Kepler's laws. This is the law that gives the duration of a planet's year as a function of its average distance from the sun. It's actually a constant times the square root of the cube of the distance. And that's what this is representing. So genetic algorithms can actually discover a thing like this and much more complex ones as well. In fact, these days, what the genetic folks are having a lot of fun with is something that's exciting and scary at the same time. They're not just doing this with software anymore. They're not doing it as a simulation inside the computer, but actually doing it out in the physical, real world with robots. You literally start with robots that are random piles of parts. And then once those robots are good enough, they actually get printed by a 3-D printer. And then they start crawling, and walking, and doing things in the real world-- seeing how fast they can crawl, trying to recover from injury, and so on and so forth. This is actually a real robot from Hod Lipson's lab. OK? And then what happens is that in each generation, the fittest robots get to program the 3-D printer to produce the next generation of robots. So if "Terminator" comes to pass, this might be how we get there. Of course, these robots are not yet ready to take over the world, but they've already come remarkably far from the soup of parts that they began with. Right? And these days, if you see a little spider running around, take a good look. Because it could be one of these instead of one of the biological ones. All right. So the Bayesians-- so here are some famous Bayesians. Judea Pearl, within computer science, is probably the most famous one. He actually won the Turing Award, the Nobel Prize of computer science, a few years ago for inventing something called Bayesian networks, which is one very powerful type of Bayesian model where it's a big graph where each node is a variable. And then the edges between the nodes represent dependencies between the variables. Two other famous Bayesians are David Heckerman and Mike Jordan. Bayesians are known, in machine learning, as the most fanatical of the five tribes. All right? They really have a near-religious devotion to their paradigm, and you know, they're the first ones to say so. And I think the reason for this is that Bayesians had their origins in statistics. And for 200 years, Bayesians were a persecuted minority in statistics. So they had to become very hard-core in order to survive. And it's a good thing that they did survive, because they have a lot to contribute. And these days, with computers, and algorithms like Markov chain Monte Carlo, and large data, and whatnot, they're actually on the ascendant in statistics. So what is the basic idea behind Bayesian learning? Well, the basic idea is that, as I said, everything that you learn is uncertain. So what you have to do is compute the probability of each one of your hypotheses and then update it as new evidence comes in. And the way you do that is with Bayes' theorem. OK? And Bayesians love Bayes' theorem so much that there was this Bayesian machine learning startup that actually had a neon sign of Bayes' theorem made. And then they stuck it outside their office. So there's Bayes' theorem in big, neon letters. OK? So how does Bayes' theorem work? Bayes' theorem is actually incredibly simple. It's so simple it's barely worth being-- it would be barely worth calling a theorem if it wasn't so important. So the idea is this. So let's suppose that you have all your hypotheses. You define your space of hypotheses in some way. And it could be, for example, the set of all Bayesian networks, or the set of all neural networks, or all decision trees, or whatever. And now the first thing that you're going to have is the prior probability for each hypothesis. This is how much you believe in that hypothesis before you've even seen any data. And this is actually what makes Bayesian learning very controversial is that many statisticians say, well, you have no grounds on which to just make up a prior. And the Bayesians answer to that is, you have to make that up, whether explicitly or implicitly. So let's just be explicit about it. So the prior is how much you believe in each hypothesis before you see the evidence. But then what happens is that as the evidence comes in, you update the probability of each hypothesis. A hypothesis that is consistent with the data will see its probability go up. A hypothesis that is inconsistent with the data will see its probability go down. OK? And the consistency of the hypothesis of data is measured by what's called the likelihood function, which is the probability of seeing the data if the hypothesis is true. OK? And this theory is actually no different from frequency statistics and the maximum likelihood principle, which is basically saying that if your hypothesis makes what you're seeing likely, then conversely, what you're seeing makes your hypothesis likely. OK? And the Bayesians incorporate that in the likelihood. And the product of the likelihood and the prior is just the posterior, which is how much you believe the hypothesis after you've seen the evidence. So as you see more evidence, the probabilities evolve. And hopefully, at the end of the day, one hypothesis will come out as clearly better than the others. But that won't necessarily be the case. You might still be entertaining a lot of hypotheses even after you've seen a lot data. There's also the marginal, which is just something that you have to divide by to make sure that the probabilities add up to 1, so let's not worry about it for now. And a lot of great things have been done with Bayesian learning. Like, for example, self-driving cars have Bayesian learning in their brains. So in some sense, Bayes' theorem is helping to drive that car or helping the car to learn how to drive. And without it, it would be much harder. But one application of Bayesian learning that everybody is probably familiar with is spam filtering. The first spam filter was actually designed by David Heckerman and his coworkers. And they just used this very simple Bayesian learner called the naive Bayes classifier. And the way it works there is the following. The hypothesis, before I've seen evidence, is that the email is a spam or that the email is not a spam. OK? And the prior probability is like your prior probability of an email being spam-- 90%, 99%, 99.999%-- take your pick. And then the evidence is the actual content of the email. So for example, if the email contains the word "Viagra," that probably makes it more likely to be spam. If it contains the word "free" in capitals, that makes it even more likely to be spam. And if that "FREE" is followed by four exclamation marks, that makes it even more likely to be spam. OK? On the other hand, if it contains the name of your best friend on the signature line, that actually makes it less likely to be spam. OK? And so what the naive Bayes classifier does is it incorporates that evidence. And at the end of the day, it computes a probability that the email is spam or not spam taking all of that into account. OK? And then based on that probability, you can decide whether to filter it out or actually show it to the user. And we're all grateful that spam filters exist. Otherwise our mailboxes would be unmanageable. These days, all sorts of different learning algorithms get used for spam filtering. But Bayesian learning was the first one and it's still used in many spam filters. OK. Finally, the Analogizers-- so as I mentioned, the basic idea of the Analogizers is that everything that we do, everything that we learn is reasoning by analogy. It's looking at similarities between the new situation that we need to make a decision in and the situations that we're already familiar with. OK? And one of the early pioneers in this area was Peter Hart. He proved some things related to the nearest neighbor algorithm, which we'll see shortly, which is kind of like the first similarity-based algorithm. Vladimir Vapnik is the inventor of support vector machines, a.k.a. kernel machines, which is the most widely used and most successful type of similarity-based learner. But these are both actually still fairly primitive forms of analogical reasoning. There are some much more sophisticated ones that people, like for example, Douglas Hofstadter, work on. And Douglas Hofstadter, of course, is also famous. He's not just a quantitative scientist and computer scientist, he's also famous as the author of "Godel, Escher, Bach," which, ironically, is actually, itself, an extended analogy between Godel's theorem and the music of Bach and the art of Escher. And in fact, his most recent book is 500 pages arguing that all of intelligence is just analogy. So he really does think that analogy is the master algorithm. And in fact, "the terminologizer" was coined by him. So let's see how this works. And I'm going to do it by proposing to you a very small puzzle. And the puzzle is this. Let's suppose I give you a map of two countries, which I fancifully called "Posistan" and "Negaland" because of positive and negative examples. And I don't tell you where the frontier between the two countries is. I just tell you where the major cities are. So the major cities in Posistan are the plus signs, Positiveville is the capital, and similarly for the negative cities. OK? And now my question to you is the following. If I only tell you where the major cities are, can you tell me where the frontier, where the border between the two countries is? Of course, you can't know for sure, right, because the cities don't determine the border. But that's what the machine learning problem is, right? You have to generalize. Now the nearest neighbor algorithm has a very simple answer to this question. It just says, I'm going to assume that a point on the map is in Posistan if it's closer to some positive city then to any negative city. OK? And the effect that this has is to divide the map into the neighborhood of each city. And then Posistan is just the union of the neighborhoods of the positive cities. The neighborhood of a city is the points that are closer to it than to any other city. OK? And then as a result, you get this jagged straight line frontier. And what's remarkable about this is that even though the nearest neighbor algorithm is extremely simple-- in fact, at learning time, it doesn't do anything, right? It's an O of 0 algorithm. It just sits there, doesn't have to do anything-- it can actually form very, very complicated frontiers. In fact, if you give it enough data, it will converge very closely. In fact, it will converge to the best possible hypothesis that you could ever have if you use more than one neighbor. But let's not go into those details. Now there's a couple of things here that are not ideal. One is that this line is probably not quite the right one, right? Because the real frontier is probably smoother. It's not jagged like that. And another one is that if you look at this map closely, you could actually throw away some of the cities, like this one, for example, and it wouldn't change anything. If you threw away this city, its area gets absorbed by the areas of these two other cities. And the frontier doesn't change at all. The only cities that you need to keep are the ones that actually take part in defining the frontier, the so-called "support vectors." In general, these are vectors in hyperspace. And they're called support vectors because they're the ones that keep the frontier where it is. So often, you can throw away the great majority of your examples and that doesn't change anything. And of course, in this example, this doesn't matter. But when you have a data set with millions or billions of examples, that does matter. And support vector machines, or kernel machines, for short, solve both of these problems. So they have a learning procedure that lets them throw away all the examples that are not necessary to define the frontier, so they leave only these. And they also make the frontier smoother. And the way they draw the frontier is by saying, well, let me try to walk from south to north while keeping the positive cities on my left and the negative cities on my right. But I want to always stay as far as possible from them as I can. OK? Think of the cities as mines and think of this as a minefield. Right? If I told you to walk all along a minefield, you would try to give each mine the widest possible berth. You would try to maximize your margin of safety. And that's exactly what support vector machines do is that they maximize the margin between the frontier and the nearest examples. And in the days before deep learning took off, support vector machines were probably the most powerful type of learning that was commonly used. OK? All right, let's look at one application of this. Again, this type of analogy-based learning, people have been doing it since the '50s, so it's used for just about everything on Earth. There is, however, one application that I think everybody has experienced even if they don't know it's an application of analogy-based learning. And that is recommender systems. OK? So for example, let's say that I want to figure out what movies to recommend to you. The idea that folks had almost 20 years ago now was a very simple one. It is, let me not try to do that based on the properties of the movie, because that's hard. Right? You know, people's tastes are very complicated things. What I should do is what is called "collaborative filtering," is find people who have similar tastes to you, meaning they gave five stars to a movie that you gave five stars to, they gave one star to movies that you gave one star to. And now if they give five stars to a movie that you haven't seen, then I'm going to hypothesize that, by analogy, you will also like that movie. And so I will recommend it to you. OK? And this turns out to work spectacularly well. It works so well, in fact, that 3/4 of Netflix's business comes from its recommender system. And Amazon also has a recommender system. About 1/3 of its business comes from its recommender system. And every e-commerce site worth its salt has something like this. And of course, these days, people use all sorts of different learning algorithms for it. But the first one was this nearest neighbor type method. And it's still one of the best. OK? So stepping back, we've met the five tribes of machine learning. We've seen that each one of them has a problem that it can solve better than all the others. And it has a particular master algorithm that solves that problem. So for example, the problem that the Symbolists solve that none of the others know how to solve is the problem of learning knowledge that you can compose in many different ways. And they learn that knowledge with inverse deduction, as we saw. Connectionists solve the credit assignment problem using backprop. Evolutionaries solve the problem of learning structure, right? The Connectionists just start with a fixed structure and just adjust the weights. The Evolutionaries know how to come up with that structure in the first place using genetic programming. The Bayesians are all about uncertainty. They know how to deal with the fact that all the knowledge that you learn is uncertain. They know how to update the probabilities of hypotheses as they see more data. They use that using probabilistic inference, which is essentially computationally efficient ways to apply Bayes' theorem to very large sets of hypotheses. And finally, the Analogizers, they can reason by similarity. They can actually generalize from just one or two examples. All right? Think of Niels Bohr. He came up with the first theory of quantum mechanics by doing an analogy between the atom and the solar system, generalizing from one example. None of the others can do this. And probably the best known analogizer algorithm these days is kernel machines. But the thing that I want to point out is that, precisely because each of these problems is real and important, none of the individual algorithms are enough. What we really need is a single algorithm that solves all five problems at the same time. We need a "grand unified theory" of machine learning in the same sense that the standard model is a grand unified theory of physics or the central dogma is a grand unified theory of biology. And in fact, a bunch of us have been working on this problem for a while. And we've actually made a lot of progress. We're still far from the end of this, but let me just give you a sense of where we are at this point. So if you think for a moment about this problem of, so we have these five algorithms or these five types of learning, how can we unify them all into one, at first, this seems like a very hard problem. And in fact, some people have claimed that it's an impossible problem to solve. It seems very hard, because the algorithms all look very different. But if you look at them closely, actually, they're not that different. They all have the same three parts, representation, evaluation, and optimization. So let's look at what those parts are and then how we can do the unification. So representation is how the learner represents what it's learning, the model or the program that it's learning. Right? It's, in some sense, the programming language in which the learner is going to write the algorithm that it discovered. Typically, it's not going to be Java, or C++, or anything like that, it's going to be something like first-order logic. All right? But it could be differential equations. It could be a linear regression. It could be all sorts of things. So the first thing that we need to do is to unify the representations. And a natural thing to do here is to start with the representations that the Symbolists use, which are variations on first-order logic, and the representation that the Bayesians use, which are generally known as graphical models. Bayesian networks are one type of graphical model. Another type is Markov networks and so on. Each of these is already extremely general. If you can combine the two, you can pretty much represent anything you might want to represent. Any computer program can, for example, already be represented in first-order logic. Any way to deal with uncertainty and weighing evidence can be represented in graphical models. There's bazillions of different models that people have in statistics that all fit into that framework. So if we can combine the two, then we have a very good representation to start with. And indeed, we have done that. In essence, what we've developed is various forms of probabilistic logic. So this is a logic that also incorporates probability and uncertainty. And the most widely used is called Markov logic networks. It's essentially a combination of logic and Markov networks. And it's very simple. It just starts with formulas and first-order logic. Think of a rule in logic, like if this, then that, for example. And then what it does is it gives each rule a weight. So if you really believe the rule, you give it a high weight. If you're not sure, you give it a lower weight. And then the probability of a state of the world goes up with the number and the weight of the rules that are true in that world. OK? So with this, we can represent pretty much anything that we'd like. Now the next part of every learning algorithm is the evaluation. The evaluation is the scoring function that tells me how good a candidate model is. How well does it fit the data? How well does it fit my purposes? In essence, what the learning problem is is to find-- within the space defined by a representation-- find the program that maximizes my evaluation function. OK? So what should our evaluation function be? Well, one obvious candidate is just the posterior probability that Bayesians use. And that, again, has a lot of other things already, special cases. But more generally, the evaluation shouldn't really be part of the algorithm. It should be provided by the user. It's for the user to decide what the learner should be optimizing. So if you're a company and your purpose is to maximize profits, then that's what the evaluation function should be. If you're a consumer and your purpose is to maximize your happiness, then that's what should be being maximized is some measure of your happiness. OK? So what the mass problem should be able to do is take anybody's objective function and then learn to optimize that. OK? Finally-- right? I just said the word "optimize"-- the third part of this is optimization, is, how do we actually find the model that maximizes that function. OK? And here, there's a natural combination of ideas from genetic programming and backpropagation, namely, to discover formulas, we can use genetic programming. Each formula in first-order logic is a tree. Right? And now I can cross these trees over and apply the genetic process to come up with better formulas that better describe my domain. And then once I have those formulas, of course, if I'm using Markov logic, I need to come up with weights for those formulas. But of course, this is where backprop comes in. All right? I have my big chain of reasoning involving many different formulas, and facts, and different steps. And all of those have weights. And in order to learn those weights, I can naturally use backpropagation. OK? So we're pretty far along in this. We haven't succeeded yet. But some people think it's only a matter of time before we do. I'm actually a little less gung-ho. I think that even after we've successfully unified these five paradigms, there are still major ideas missing, ideas that we haven't had and without which we will not have a truly universal learner. And in fact, part of my goal in writing the book was to open it up to people who are not already machine learning researchers so they can think about the problem and maybe have ideas that people who are already thinking along the tracks of one of the tribes wouldn't have. So if you figure out how to solve this problem, let me know so I can publish your solution. OK. Let me conclude by talking a little bit about what I think that the master algorithm will enable that we cannot do today. I have four items here. There are, of course, more. But I think these four give a good sense of just how momentous a development this would be. The first one is home robots. We would all love to have robots that cook dinner for us, and do the dishes, and make the beds, and whatnot. But why don't we have them today? Well, first of all, it can't be done without machine learning, right? There's just no way to program a robot to do all the things that it might have to do. But second of all, the learning algorithms that we have today are not good enough. Because a robot, a home robot, in the course of a normal day, will run into all five of those problems multiple times. So it needs to be able to solve all of them. So with a master algorithm, we're on our way. But without it, I think it'll be much harder and much slower if we ever get there. Here's another one. Everybody, including, of course, Google, has a project to try to turn the web into a knowledge base. Right? Instead of issuing keyword queries and getting back pages, what I would like to do is ask questions and get answers. But for that, all the knowledge that's on the web has to be represented in a way that the computer can reason with, something like, for example, first-order logic. On the other hand, the web is full of contradiction, and noise, and gaps, and whatnot, so it's going to be very noisy. So I need probability. OK? So again, I need to be able to combine those five types of learning in order to really be able to extract this knowledge basis from the web. OK? Here's perhaps the most important one, cancer. Why haven't we cured cancer yet? The problem is that cancer is not one disease. Right? Everybody's cancer is different. And in fact, the same cancer mutates as it goes along. So it's very unlikely that there will ever be one drug that cures all cancers. The real cure for cancer-- or at least, that's what an increasing number of cancer researchers believe-- is something like a learning program that takes in the genome of the patient, the medical history, the mutations in the tumor cells, and predicts, for that tumor, what is the drug that's going to kill it without harming the patient's normal cells-- or maybe the sequence of drugs, or a combination of drugs, or perhaps even a drug that will be designed from scratch for that particular patient. OK? In some ways, this is not a very different problem from recommender systems that recommend a book or a movie to you. However what they do here is they recommend a drug. The problem, however, is that this problem is orders of magnitude harder than the problem of recommending a drug-- or the problem of recommending a movie or a book. You have to understand how the cell works. You have to understand the interactions between the genes and the proteins that they make. And then they go back and regulate the genes. And it's when this machinery gets out of whack that you get cancer. OK? The good news is we actually have a lot of data to do this, things like microarrays, and sequences, and whatnot. But again, with the learning algorithms that we have today, we're not going be able to do it. With something like the master algorithm, we will be able to do it. OK. Finally, apropos of recommender systems, let me mention this one. What I would really like to have as a consumer is not 500 different recommender systems recommending 500 different things to me-- Netflix recommending movies and Amazon recommending books and Facebook selecting updates and Twitter selecting tweets. What I want is a complete 360-degree model of me. Learn from all the data that I generate. And then that model knows me much better than all these tiny, little models and, as a result, can make can make much better recommendations-- and not just recommendations of small things, but recommendations of jobs, recommendations of houses, recommendations of what to major in, or-- oops, I guess I skipped this slide here. As the foremost city of the federal government said, if we use these things, we can actually have a recommender system that is, in essence, your best friend throughout your life, recommending the things that you need at every step. OK? And again, in order to do that, we need not just the data, which, increasingly, we have, we need the algorithms that are powerful enough to learn that rich model of you. OK? More about all these things in the book, "The Master Algorithm." And thank you for listening and I'll take questions. [APPLAUSE] AUDIENCE: The representation that you proposed, the Markov logic network, covered the natural language. Like, is there a one-to-one mapping between the representation and the-- PEDRO DOMINGOS: Yeah. In fact, one of the biggest areas where Markov logic has been applied is natural language. It's a very good match for natural language, because natural language is both very compositional-- right, so you need the logic for that-- and also very ambiguous, very uncertain, very noisy. So you need a probability for that. So people, at this point, have applied Markov logic networks to pretty much every major problem in natural language, and won competitions using it, and so forth. So in some ways, natural language is one of the killer apps for Markov logic networks. AUDIENCE: Is there a way for us to learn to use it in the genetic programming algorithms? PEDRO DOMINGOS: So the question is, are we able to learn Markov logic networks using genetic programming. People haven't done that yet. So we've learned-- there's a wide range of algorithms for learning the structure of Markov logic networks. They are similar to genetic programming, but without the crossover. All right? The Evolutionaries really believe in crossover. Everybody else in machine learning thinks something like hill climbing, greedy search, or beam search is probably enough. And there are many different search methods that have been used and that work pretty well. It's actually a good question. That part has not been done yet, whether actually doing the crossover between the formulas will help. AUDIENCE: Related question to genetic programming-- how is the solution space limited? Because it seems that combining various sub-trees would lead to a rapidly fast, exponentially growing solution space. And has there been any success with NP-hard problems? PEDRO DOMINGOS: So something that people have observed in genetic programming is that the trees tend to get bigger and bigger over time. People jokingly call it the "survival of the fattest." What you can do to combat that is to actually have a bias, a penalty on the size of the trees. So if the trees are big, they're less fit just because of that. You put that into your objective function. AUDIENCE: NP-hard problems-- any success in that area with genetic algorithms or programming? PEDRO DOMINGOS: Well, the short answer to that is yes and the longer answer is yes, but. A lot of the problems that people are approaching with genetic algorithms are NP hard problems. The question that the people who aren't Evolutionaries ask is, did you really solve-- so first of all, they're NP-hard problems, right? So you can't solve the worst, the hardest instances efficiently. But there are many instances that aren't the hardest. And so, can you solve them well enough? And there are examples of genetic algorithms solving these problems. But there are also counterexamples of people saying, look, I could have done that with hill climbing and it would have been just as fast or just as good. So the jury's still out on this. AUDIENCE: So with your 360-degree recommender systems, what's to keep that from being self-fulfilling or guiding a person into some particular path that's not, perhaps, [INAUDIBLE]? PEDRO DOMINGOS: What's going to keep that from being self-fulfilling is the way machine learning works, which is the system recommends something to you. Or for example, it recommends 10 alternatives to you. All right? Or let's say it recommends one, but then you say, no, this was the wrong one. Then it learns from that. So the recommender system is not something that you learn one day offline and then you use it. It's something that is continuously learning from what you do. So if it starts doing the wrong thing, you start being unhappy and you start displaying that unhappiness in one form or another. And then it learns, from that, to try something else. And also remember, these systems, they can talk with each other to the extent that you decide that they can. So it won't just be learning from you, it will be learning from a lot of people. All right? So it's always the question of, how far do you have to generalize, right? The more data that you have, the easier the problem becomes. And the more you have a continuous loop of feedback, the more robust the learning is. AUDIENCE: The 360 recommender, how do you square it with the need for privacy for people? PEDRO DOMINGOS: Yeah, so that's a very important question which I didn't allude to. I want to have a 360-degree model of me, but I want it to be under my control. Because if somebody else has a model of me that knows me better than my best friend, they have too much power. So I think what needs to happen is that-- right now, what happens is that the data that you generate is spread all over the place. It would be good to bring it all to one place, but that place has to be under your control. And I think one way that this could work is that you could have a company that is to a data like a bank is to your money. Right? You put your money in the bank, and then the bank invests it for you, and so on and so forth. But ultimately, the money is under your control. So I think the same thing should happen with your data. So this company aggregates the data. It learns the model. It uses the model on your behalf. But at the end of the day, it's for you to decide whether, maybe, for example, you want to take the data somewhere else or you want to do something else with it. I think if you don't do that, people won't trust this enough for it to happen. AUDIENCE: What are your thoughts on TensorFlow? PEDRO DOMINGOS: I haven't played with TensorFlow myself. I think, first of all, it's great that TensorFlow has been released. I think releasing open source software like this is a large part of how we make progress. And definitely, deep learning and doing it on a large scale are very important. There are a number of other alternatives out there. We'll see how TensorFlow compares with them. And we'll also see-- like just from my own point of view-- I have some of my students working on deep learning-- the question with each one of these systems is, what does it support well versus not. I think if the learning that you're trying to do fits within the paradigm of TensorFlow, then it's probably a good thing to use. But if it doesn't, then you may need something else. Or maybe what's going to happen is that there will be an extension of TensorFlow to do these things in the future. AUDIENCE: One of its goals, actually, is to become a universal system for expressing AI solutions. PEDRO DOMINGOS: Yeah, and I very much sympathize with that goal. I think that in order to be a universal system, you have to cover these five paradigms. But one way to cover them is to start from one-- let's say deep learning-- and then try to absorb another, and another, and another. And there are people doing this kind of thing. Like for example, we have the Alchemy System, which combines the Symbolist and the Bayesian learning. We've also developed, for example, combinations of symbolic learning and instance-based learning and this with neural networks. So what I hope to see from a platform like TensorFlow is see it absorbing more and more of these capabilities. And absorbing them doesn't necessarily mean going and giving us some primitives to do what that school does. That's OK, but it also increases the complexity. The ideal thing would be you still have something that is simple, but yet with that simple thing, you can do all these combinations of things. And I do think that's possible. And I'm curious to see how this is all going to evolve in the next several years. MALE SPEAKER: And with that, please join me in thanking Pedro for this talk. PEDRO DOMINGOS: Thank you. [APPLAUSE]
B1 learning machine learning algorithm pedro knowledge machine Pedro Domingos: "The Master Algorithm" | Talks at Google 198 17 richardwang posted on 2015/12/01 More Share Save Report Video vocabulary