Subtitles section Play video Print subtitles MALE SPEAKER: This is my attempt to increase the sartorial quotient of Google, and it hasn't worked at all. On the other hand-- well, I noticed you have a coat on, that's true. Greg Chesson gets two points for showing up with a coat. It's a real pleasure to introduce Bruce Schatz to you. I've known Bruce for rather a long time. My first introduction to him came as we both began getting excited about digital libraries and the possibility of accumulating enormous amounts of information in digital form that could be worked on, manipulated by, processed through software that we hope would augment our brain power. So Bruce has been in the information game for longer than he's actually willing to admit I suspect. He's currently at the University of Illinois, Champaign-Urbana. As you will remember, that's also the area where the National Center for Supercomputer Applications is located. Bruce was around at the time when Mark and Jason was doing work on the first browsers, the mosaic versions of the browsers derived from Tim BernersLee's work. Actually, the one thing that Bruce may not realize he gets credit for is teaching me how to pronounce caenorhabditis elegans. I looked at it before and I couldn't figure out, and maybe I didn't even say it right this time. But this is a tiny little worm that consists of 50 cells. It was the first living organism that we actually completely sequenced the genome for. Then we got interested in understanding how does the genome actually reflect itself as this little worm develops from a single fertilized cell. So Bruce introduced me to the idea of collecting everything that was known about that particular organism, and to turn it into a database that one could manipulate and use in order to carry out research. Well, let me just explain a little bit more about his background and then turn this over to him, because you're here not to listen to his bio, but to listen to what he has to say. He's currently director of something called CANIS-- C-A-N-I-S. I thought it had to do with dogs until I re-read it. It says Community Architecture is for Network Information Systems. BRUCE SCHATZ: That's why they let me in the building. MALE SPEAKER: I'm sorry. BRUCE SCHATZ: That's why they let me in the building. MALE SPEAKER: Because along with the other canines that are here. It's at the University of Illinois, Champaign-Urbana, and he's been working on federated all the world's knowledge, just like we are, by building pioneer research systems in industrial and academic settings. He's really done a lot of work over a period of 25 or 30 years in this domain. The title of the talk uses the term telesophy, which he introduced as a project at Belcorp in the 1980s. Later on, he worked at UIUC on something called DeLIver D-E-L-I-V-E-R, and now more recently on semantics. That's the reason that I asked him to come here. He's working on something called BeeSpace, which is spelled B-E-E, as in the little buzzing organism. This is an attempt as I understand it, but I'm going to learn more, an attempt to take a concept space and organize it in such a way that we can assist people thinking through and understanding more deeply what we know about that particular organism. So this is a deep dive into a semantic problem. So I'm not going to bore you with any more biographical material, except to say that Bruce has about nine million slides to go through, so please set your modems at 50 gigabits per second because he's going to have to go that fast to get through all of it. I've asked him to leave some time at the end for questions. I already have one queued up. So Bruce, with that rather quick introduction, let me thank you for coming out to join us at Google and turn this over to you to teach us about semantics. BRUCE SCHATZ: Thank you. I have one here, so you can just turn yours off. Thank you. I was asked to give a talk about semantics, which I supposedly know something about. So this is going to be both a talk that's broad and deep at the same time, and it's going to try to do something big and grand, and also try to do something deep that you can take away with it. So that may mean that it fails completely and does none of those, or maybe it does all of those. I've actually been giving this talk for 25 years and-- now, of course, it doesn't work. Am I not pointing it in the right place? I'm pushing it but it's not going. Oh, there it goes. OK, sorry. Can you flip it back there? Sorry about that. Small technical difficulty, but the man behind the curtain is fixing it. So I gave this talk first more than 20 years ago in the hot Silicon Valley research lab that all the grad students wanted to go to, which was called Xerox PARC. I think a few people actually have heard of Xerox PARC. It sort of still exists now. We went down completely? There we go. Thank you very much. I was pushing this idea that you could federate and search through all the world's knowledge, and the uniform reaction that was, boy, that would be great, but it's not possible. And I said, no, you're wrong. Here, I'll show you a system that searches across multiple sources and goes across networks, and does pictures and text and follows links, and I'll explain each piece about how it works. Then they said, that's great, but not in our lifetime. Well, 10 years later was mosaic and the web. And 20 years later I'm delighted to be here, and all of you have actually done it. You've done all the world's knowledge to some degree. What I want to talk about is how far are you and what you need to do before you take over the rest of the world and I die, which is another 20 years. So what's going to happen in the next 20 years. The main thing I'm going to say is a lot's happened on tele, but not too much on sophy. So you're halfway to the hive mine, and since I'm working on honey bees, at the end you will see a picture of honey bees and hear something about hive mines, but it will be very short. Basically, if you look at Google's mission, the mission is doing a lot about access and organization of all the world's knowledge. Actually, to a degree that's possible, you do an excellent job about that. However, you do almost nothing about the next stages, which are usually called analysis and synthesis. Solving actual problems, looking at things in different places, combining stuff and sharing it. And that's because if you look at the graph of research over the years, we're sort of here, and you're doing commercially what was done in the research area about 10 years ago, but you're not doing this stuff yet. So the telesophy system was about here. Mosaic was about to here. Those are the things that searching across many sources-- like what I showed, we're really working pretty well in research labs with 1,000 people. They weren't working with 100 million. But if Google's going to survive 10 more years, you're going to have to do whatever research systems do here. So pay attention. This doesn't work with students. With students I have to say I'm going to fail you at the end. But you have a real reason, a monetary reason, and a moral reason to actually pay attention. So back to the outline. I'm going to talk about what are different ways to think about doing all the world's knowledge, and how to go through all the levels. I'm going to do all the levels and sort of say you are here, and then I'm going to concentrate on the next set of things that you haven't quite got to. The two particular things I'm going to talk about our scalable semantics and concept navigation, which probably don't mean anything to you now, but if I do my job right, 45 minutes, actually now 10 of them are up, so 35 minutes from now they will mean something. At the end I'm going to talk about suppose you cared about this enough to do something, what kind of big thing would you actually do? I sort of do these big, one of the kind pioneering projects with stuff that doesn't quite work just to show it's really possible. So the overall goal is you probably all grew up on reading cyberspace novels is sort of plugging your head and being one with all the world's knowledge. Trying to sort of get the concepts in your head to match whatever is actually out there in a way that you can get what you want. The problem is over time what the network can do has increased. So in the-- I can't say the old days, man-- in the good days, people worked on packets and tried to do data transmission. The era that I sort of worked mostly in was an object era where we try and give the information to people to do, [UNINTELLIGIBLE] to do pictures. All the action in big research labs now is on concepts, is on trying to do deeper things, but still it worked like these too. They work everywhere. So you don't have a specialized AI program that only works for income taxes. That's not good enough. No Google person would ever do something that only works in one case, unless there was a huge amount of money behind it. I'll stop making money comments, but the food is great here. So this is one common layout, and there's four or five others, which in the absence of time, I will omit. But if you want to talk to me afterwards, there's lots of points of view about how to get from here to there, where there is always all the world's knowledge, and here is whatever you can do now. Depending on what point of view you take, it's possible to go to the next step differently because you have a different orientation. So the one that I'm going to do in this talk is the linguistic one, which usually goes syntax, structure, semantics, pragmatics. So syntax is what's actually there, like an actual set of bits in a file, a set of words in a document. Structure is the parts, it's not the holes. So if you part something in structure, you can tell that this particular thing is a person's name, this is the introduction to a paper, this is the methods part. You can tell what the parts are and you can search those differentially. Semantics is when you go inside and you try to get something about the meaning, and as you'll see, people have pretty much given up on doing real meaning, and they pretty much try to do, rather than meaning, they try to do context. What's around it in a way that helps you understand it. Actually, when Google was a research project, and the people that started it were actually on the Stanford Digital Library Project, I was running the Illinois Digital Library Project at the same time, they said there's enough context in web links to be able to really do something. There were a lot of people that said no, web links are made for all sorts of things, and they don't have any semantics, and they're not useful at all. But obviously, they were wrong enough to make this building and employ all of you. The real goal is down here in doing actual reality, in doing with so-called pragmatics. Pragmatics is sort of when you use something. So it's test dependent. The meaning of something is always the same. so if this is a gene that regulates cancer, it always does that. But lots of time, the task you're working on varies what you're interested in, what you know. I'm not going to say very much about pragmatics because people have gotten very far on it in terms of doing a big grand scale. But I actually know quite a bit about it. If you really wanted to solve health care, for example, you'd have to go down the pragmatic route and try to measure people with as large a vector as you can possibly get. And again, if people are interested, that's a topic I'd be happy to talk about, but it's off this particular talk. This particular talk is about federation, as I said. So what does it mean to federate each one of those levels? So to do syntax federation, which is what the telesophy system pioneered, and for the most part, what Google does in the sense of federating all the web sources that are crawled, is it tries to make essentially send the same query into every different place. So true syntax federation, which is actually what telesophy did, but not really what Google does, is you start at your place and you go out to each one of the sources and they have to remember where they are on the network. They might go up and down, and so you might have to retry them. And you have to know what syntax the queries need. And when the results come back, you have to know how to handle that. You have to do a lot about eliminating duplicates when the results come back. So a very common problem is you send out a query to try to get a certain Beatles song, and you get back 5,000 of them, but they're all slightly different, and they're in different languages and they have different syntax. Merging those all together is really complicated. So that's what syntax federation is. Structure federation, which is what this-- DELIVER was the DLI, the Digital Library Initiative project that I ran at the University of Illinois. It took about engineering literature, it went out to 10 major scientific publisher sites on the fly and allowed you to do a structured query. So you could say find all the papers in physics journals that are within the last 10 years that mention nanostructures in the figure caption in the conclusion. So you're using the parts of the papers to make use. And at least scientists make a great deal of effort in doing that. In order to do that, you have to figure out some way of making the mark-up uniform. So you have problems that you just started to see in syntactic world where who's an author? If you have a physics paper that has 100 authors, which one of them is the author? It might not be any of them actually, it might be the organization that did it. Or if you have a movie, who's the author of a movie? Is it the producer, the writer, the star, the director? So there's a lot of problems there in how you do the mark-up uniformly and how you make different values the same. For the most part, structure has not made it into mass systems yet, although there have been a lot of attempts to try to make languages for structure like the semantic web that Vin and I were talking about beforehand. But the amount of correctly marked-up structured text is very small right now. So if you were going to use it to search the 10 billion items that you can crawl on the web now, you wouldn't get very far. Semantics federation, which is what I'm going to talk about today mostly, is about a completely different topic. It's about going inside and actually looking at the phrases and figuring out the meaning, as much of the meaning as you can. And then when you have many small pieces, trying to match something that's the same here to something the same here. And doing that uniformly is the job of semantics federation. So let me now go into the first of the two technical topics. So the first topic I'm going to do is how do you actually represent the things, and that's going to be a little slow going. Then I'm going to give some examples of if you're able to get this deeper level representation, this deeper level structuring, what kind of system you can build. It's in a somewhat specialized domain. It's in biology and medicine, because well, if you're a professor and you work at a university, that's where you can get money to work on things. You can't get money to work on the kind of things that are arbitrarily on the web. So scalable, so we're now into scalable semantics. I've been using this for 10 years, and every once in while someone will stand up and say that's an oxymoron, it doesn't make sense because semantics means really deep, and scalable means really broad, and those pull in opposite directions. And I said yes, you understood what the problem is. So in the old days, what it used to mean is-- what semantics used to mean is you do deep meaning. So you had a deep structure parser that would go in and figure out yes, this document was on operating systems that only work on this class of computers, and only solved this class of physics problem. So it's on a very narrow, detailed topic. There were many, many AI systems made that did that. What happened when the government started putting large amounts of money into it-- so most of this got developed in the-- the base technology got developed in the DARPA Trek program trying to read newspaper articles looking for what would now be called terrorists. What they found basically is the deep programs were very narrow. If you trained something to recognize income taxes, or you trained something to recognize high-powered rifles, it wouldn't help at all in the next one. And there were just too many individual topics to try to pick out the individual types of sentences and individual slots. So what happened is the broad ones beat out the deep ones when the machines got really fast. When it became clear, and I'll show you some machine curves, when it became clear that you could actually parse noun phrases arbitrarily out, then people begin using noun phrases. When it became clear you could do what are called entities, in other words, you could say this phrase is actually a person. This phrase is actually someone that lives in California. Then people started using it. Basically what happened is semantics changed from being we know everything about this particular topic and this phrase means one, it's meaning type 869, to we have 20 kinds of entities, and this is a gene, and it occurs with his other gene. So we'll say if you search for this gene and it doesn't work, you should search for this other one. I'll show you lots of cases where that sort of guilt by association really helps. I'm not defending it necessarily as being real semantics, I'm defending it as something that you can do everywhere. So the upshot is this is an engineering problem. It's a question of if you could do deep parsing and say yes, this person wasn't-- it's true they said they were interested in ice cream cones, but they really meant pine cones when they said cone, then you would do that. But it's generally not possible to do that, except in very isolated circumstances. So you end up thinking globally, thinking about all possible knowledge, but acting locally. I guess this is a green building so I'm allowed to make this kind of joke. So you look at a small, narrow collection, and analyze the context, what occurs with each other very precisely, and do something there. And that creates one good situation. In other words, it means now you're able to go much deeper, and I'll show you lots of examples of going much deeper. But it creates one bad situation, which is traditionally information retrieval works like dialogue did in my era, or like Google does now. You take everything you can get, and pile it into one big huge server farm, and then you search it. You index it once in one big index and you search it. Well, the problem is if you want to go deeper in semantics that doesn't work, because you mixed together too many things. You have to unmix them, and then you have to worry about to get from here to there. So you change a central problem into a distributed problem with all of the hard features that go with distribution. What this is doing in terms of-- if you want physical analogy. For many years I taught at a library school. The way indexes work in the real world is for really big topics like if you have electrical engineering, there's a society that is big enough and well-defined enough to employ people to tag every topic. So they say here's an article about Windows, this one is about operating systems. Here's an article about Windows, this one is about heat conservation. A person who's looking at that, and out of their selection of all the topics, they say which topics the things are on. That worked fine as long as most the information in the world was in these large, but fairly small number of well-defined databases. That's not the world we're living in now. We're mostly living in this world. So there still are a very large number of big formal databases that are done by hand, but nearly all the databases, nearly all the collections, are these informal ones with communities or groups or individuals. The advance of crawling technology that's been able to take all these and collect them all together into one big place has actually made the problem worse because now there's not only apples and oranges and pears altogether, but there's lots of things that aren't fruit at all and aren't really anything, but they're in there. So there's many different things that you don't know how to deal with, and you have to do something automatically with them. It's not the case that you can get-- my daughter who keeps track of all the cats on the block and has a website with their pictures, it's not the case that you can get her to employ a professional curator from the library school who will tag those correctly so that someone who's a cat fancier in the next town can see them. That's not true. Need some kind of automatic support. So I'm going to talk about the automatic support. I'm doing OK for time. There's two things. I'm going to talk about entities and I'm going to talk about concepts. So here are entities. What entities are is trying to figure out what type of thing something is. So one way is you have hand-tagged XML, like the mark-up, like the semantic web. So they take a particular domain and they say there are 20 types here, and we'll mark-up each document correctly. So if we're in humanities that might work pretty well. This is a person, this is a place, this is a type of vase, this is a time period in Roman history. If you're out on the web in that situation where 90% of the stuff's informally, then even if there was a systematic set of types, the people aren't going to do it. So if you have well marked-up hand ones you're going to use them, but if you don't then you have to do something automatic. The thing that tends to work automatic is to try to tag things by machine with training sets, and I'm going to say a little bit about what that means. First you go into the document and you pull out the phrases. So you don't do whole words. And in fact, over time the experimental system I've built have gotten better the more you can get away from words and change them into whole phrases that are the equivalent phrase that works in that particular domain. Right now search engines don't do that. That's a big part of the problem. Then you have to recognize the part of speech. Is it a noun or a verb or an object. Again, 10 years ago, you needed a specialized grammar and it only worked in a particular subject. Now there's things trained on enough machine learning algorithms, trained on enough things that you can get very high accurately, up in the high 90s, with parts of speech. And in fact, they're actually systems, and you can tell this was secretly funded by the CIA under some other name that recognized person, places, and things pretty accurately. So if you want to recognize newspaper articles and automatically tag these correctly, it actually does a pretty good job. Again, commercial search engines tend not to use those. So here's an example of entities in biology. These won't mean very much, but they'll give you the feeling. Here's a kind of functional phrase. A gene is a type of an entity and encodes a chemical. So here's an example. The foraging gene encodes a cyclic GMP protein kinase. So this is one of the entities and this is the other entity. In scientific language things are very regularized, so there's lots of sentences that are actually that easy. Or here's another one. Chemical causes behaviors. Here's one that's a little harder. I tried to put one a little harder. This one says gene regulates behavior, but that's not in the sentence. What's actually in the sentence is this gene, which is a ortholog of this other gene-- so it doesn't say directly, it says indirectly it's a gene-- is involved in the regulation, which is not the same phrase as regulates. So you have to do a little bit of parsing to get a phrase like gene regulates behaviors. But the natural language technology is now good enough to do that accurately. I did do a little bit of prep and look at some of the commercial systems that were doing this. If you want to ask a question about those later I'll make a comment. But they're all competitors, they're not Google, so I didn't want to say up front. The last comment I'm going to make about entities is they come in different varieties. That means that sometimes you'll do them and sometimes you won't. So there's some of them, and again, these are biology examples. There's some of them that are just straight lists, so the names of organisms, like honey bee or fruit fly are almost always exactly those same words. So those are easy entities to tag very accurately. Things like genes or parts of the body vary somewhat, but there often are tag phrases that say this is the part of a body and here it is. It's a wing. Or this is a gene and it's the foraging gene. So there are often tags there. If you get training sets you do pretty well. Then there's really hard things like what kind of-- these are sort of functional phrases-- what kind of behavior is the honey bee doing? What kind of function does the computer operate with? Those ones are almost always different, so you need a really big training set to do those accurately. If you were going to try to do entities across all the world's knowledge, you would have two problems. I think that's the last thing I'm going to say on this, yes. The first is you would have to try to make a run at the hard ones, or at least say well, we're only going to do these because that's all we can do uniformly. The second thing is you have to realize that the entities are different in each major subject areas. So the biology ones are not the same as the medicine ones, which are more disease-like, and the medicine ones aren't the same as the physics ones, and the physics ones aren't the same as the grocery store ones. My guess is there's a relatively limited number of popular ones, and if you're back in the same style that, trying to classify all the web knowledge, like Yahoo!-- that used to be Yahoo!'s main strategy, for instance. That there are a couple hundred really important ones and a couple thousand big ones. So if you had enough money and enough expert teams and set each one up to making training sets, you could actually do entities all the way across. A research project can't muster that except in one small area. That's all I'm going to say about entities. Now, let me explain just a little bit about what you do with entities, and then give a big example. So what you do with entities, you know you might think you're going to answer questions with them, and that's what the commercial systems are doing. You can sort of answer questions, so you can say this gene seems to affect this behavior and this organism. So you can say what are all the things that affect foraging in insects and get out lots of-- this is sort of you have a relational table. You take a document and change it into a relational database. You can answer that kind of question, but there's lots of kinds of questions you can't answer. What you can do is after you extract these entities, these units, is you can compute these context graphs. You can see in this document how often do these two things occur together. That one you get a lot of mileage from, because if you try to search for this one and you can't find it, you can search for this other one. Or if you're trying to search for this one and you can't find it, you can go down the list of the ones it commonly occurs with and it's sort of a suggestion facility. People that watch search in libraries, what they typically comment on is people don't know what words to try. They'll try all the words they can think of and then they'll start searching dictionaries or looking at other papers or asking the people next to them. So since you can automatically do suggestion by making this graph of all entities that are related to all the other entities in terms of how often they occur together in a collection, then you can use it for suggestion. This is my computer engineering slide. The other unusual feature about Google that people didn't predict is could you build a big enough supercomputer to handle 10 billion items. And dialogue would have said no because IBM will not sell you that many platters and that big a thing. Well, what they didn't realize was what the rise of PCs would do if you hook things together and you could partition the problem enough. The research people hit that same curve a decade earlier. So I was trying to do these relations-- this is my six or seven year history. These are all how big a collection you can do and find these entities and these relations basically on workstations. So this is like a Sun-2, and that's a Sun-3, and this is a network of Sun-3's, about 10 of them. This one is discovering supercomputers at NCSA and you could get 1,000 all at one time. That made a big difference, and it meant-- in fact, this was a big hero experiment. It was the first supercomputer computation in information retrieval. For quite a while, it was the biggest computation that NCSA had ever done. They couldn't figure out why you'd want to integrate all the world's knowledge. Why would anybody want to do that? I think in 1998, Google was probably about 10 employees. So that question hadn't come up yet. The number of articles in Medline was still much greater than the number of articles on the web. So here's what that computation was like. It had about 280 million concepts, so that number was big then. It's now small. However, if you fast forward to today, the machines are lot faster, so the server I just bought for $25,000 has more memory than that supercomputer eight years ago. These are big memory computations. You can guess it's got a big matrix inside that has all the phrases versus all the phrases and how often they occur. So the more physical RAM you have the better. What it turns out is if you're able to put a connection graph, so this is a graph of which terms are related to which other terms all in memory. And it's a nice graph, like a small worlds graphs, which looks kind of like this. So there's a group here that's all sort of connected and another group here. So it comes in groups. That tends to be true of just about any kind of text that people have seen. Then you can find all the inter-relations really fast because you don't have to look at this one versus this one because you know that you can stop here. So there's a way of ending the propagation. What that means is you can now do things on the fly while you wait. So you don't have to pre-compute the collections anymore, which is what we had to do before. You can do a search, make a new collection, and then make it semantics. Then make it deep. You can cluster it on the fly into little chunks, you can find these inter-related graphs while you wait. And that's while you wait with a $25,000 server. If you had something the size of Google you could not only do all the world's knowledge, which isn't that much proportionally bigger, but you could also do deeper. So that's why now I want to show you what a real concepts-based system looks like so that you get some feeling as to how different the interaction is. Generally, there's two things in a space system. One of them is called federation-- I've been talking about that before. It's how do you go from one collection to another. The other is called the integration. It's if you have an entity what can you go out to. We're talking about going across collection, and I didn't mean to say this was replacing IP. IP is under everything. But what I meant is it's replacing words. This was the first interspace system, the one DARPA paid for. There aren't words in this anymore. When you point to this you get-- you get that whole phrase, simple analgesics, and all the things that are equivalent to it phrase-wise after you do all the linguistic parsing. So it looks like a bunch of words, but it isn't. It's a bunch of inter-related concepts and those are uniformly indexed across all the sources. So you can go from simple analgesics here to all the concepts, all the phrases that it's nearby, to the ones that are nearby there, to the documents, to which little cluster it's in. You can sort of go from concept to concept, the concept across all the different sources. The main reason I showed this was to just show that words don't exist anymore. You've got these deeper level thing, which I tried to convince you earlier was possible to do. Also because this DARPA project broke up in 2000 just before 9-11, and DARPA decided, they yanked the plug, and they didn't want to help analysts anymore. Every person on the project went to work at Microsoft. So it's entirely possible that Windows 2010 is going to have all this stuff in it. Yes, question? AUDIENCE: That's the [INAUDIBLE]. BRUCE SCHATZ: Which one, this one? AUDIENCE: The hexagon, [INAUDIBLE]. BRUCE SCHATZ: It is actually, and let me postpone that because I'm going to answer it better later. But basically, it's taking the document collection and grouping it in to individual groups of documents which have a similar set of phrases in them. This is just a bad graphical representation of it. But I'll give a good example of it later. So yes, it's quite meaningful, and you'll see in the session what its utility is. So what I'm actually going to talk about now in the last five minutes of my talk is this BeeSpace system. It is about honey bees, so you're allowed to have cute pictures and cute puns. Yeah, see? The college students don't ever laugh at this, but I always thought it was funny, bee-havior. So much for that. It must be an age thing. So the point of this system is you make many, many small collections, and you know something, and you want to use your terminology and your knowledge to go somewhere else. So you want to go from molecular biology into bees, into flies, into neuroscience. So I'm working with the person that actually is the national lead on the honey bee genome. What it does inside is basically uses this scalable semantics technology to create and merge spaces-- you'll hear a lot about spaces in the next five minutes, so I won't explain right now-- to try to find stuff. So it's complete navigation, complete abstraction, but finding things when you don't know what you started with. Space is a paradigm, not a metaphor. I hope I'm not offending any user interface people. I'm not sure if Dan is still sitting in the back. In other words, there really are spaces in there. You take a collection, you make it into a space. You can then merge two of them, you can pull out part of it, you can break it into parts and make one part of that the whole space. So it's like you have all the world's knowledge and you're breaking it into conceptual spaces which you can manipulate. You personally, plus you can share them with other people. So it has quite a different character than you're trying to do a search and you get a set of results back. This particular one does do entities very universally, but it only does concepts and genes because that's all this subject area needed. So please don't criticize that particular one. It was chosen narrowly because we wanted to at least have one that did those uniformly. These are the main operations I'm now going to show you very quickly through a session. If you go to the BeeSpace site, which was on that bag. It's beespace.uiuc.edu. You can use the system and beat it to death, assuming you can read Midline articles, which you may or may not be able to. So extract is going to take a space and figure out what all special terms that distinguish that space and have a way of searching. Mapping is going to go back the other way. It's going to take a space and break it into parts, and then you can turn each one into a space itself. This is space algebra, and this is the summarization. If you find an entity, it does something with it. This example probably don't care, but it's looking at behavioral maturation. It's looking at a honey bee as it grows up, it takes on different societal roles. It takes care of the babies, it goes out and forges for food, and looking at that across different species. So it's a complicated question. It's not one that there's a well-defined answer to. So now we're into the BeeSpace system, which is running right now. So you type behavioral maturation, you choose a particular space that was already made out, it's insects, it's about 100,000, and you do browse. So that gets about 7,000 articles, which are here, which is too much to look at. The problem was behavioral maturation wasn't the right term. The first thing the system's doing is it's extracting. It tries to go in and analyze the terms, the phrases, and get out a more detailed set. So that's issuing extract. It automatically pulls out the most discriminating terms in that collection, and you usually have to edit it a little. That's what I did here. Then you can take those back and browse again. It's not working. Oh, did it go? Yeah, I'm sorry. There it is. You got more items. 22,000. AUDIENCE: That's not necessarily good if you were trying to narrow it down. BRUCE SCHATZ: The problem was you narrowed it down too much. You didn't actually get the articles about behavior maturation because a lot of them didn't say it. What you want to get is all of the things that might be interesting and then narrow it down. So that first one was trying to expand it a little bigger. It was doing sort of a semantic version of query expansion. Now the problem is this one is too many to actually look through, and now I'm going to go back the other way and sort of answer the question that was asked before. So this is automatically taking that collection, and while you wait, it's breaking it into a number of different regions. Here it's about 20-- some of them are off the page. And the regions tend to be-- they're sort of these small worlds regions that tend to be tightly on the same topic. The topics are kind of hard to describe what the topics are because they're automatic, they're not on some well-defined term, but they tend to cluster together well. The thing to notice is this was done while you wait, even with this small server. So the collection was made on the fly, and this mapping was done on the fly. This is an interactive operation. The pre-computation wasn't about this. You didn't have to can this. So we take this particular region, and now we're going to operate-- this is that, just that one cluster. Now we're going to save it. And see, now it's a fully fledged space just like all the previous ones. So we were sort of navigating through space, we found this little collection is what we want, and now we're making it into a space ourselves. This one is now well-defined, it's about behavioral maturation in a large about insects, and we wanted to look at multiple organisms. So now we're going to start doing space algebra. We're going to start taking this and merging it with other things. So here I took the new space I just made and I'm intersecting it with an old space. Currently, that's just finding the documents in common, but we're now working on fancier data mining to try to other patterns. So here's the 21 that have that feature. If you look at this article, this article is, in fact, about some basic receptor about Drosophilidae, which is the fruit fly, an insect, but it's about-- well, I'm sorry it's not fishes. Marine crustaceans are like lobsters. But it's something that lives in the sea. Since you now found something at the intersection of those two, what you really wanted to do was describe the genes. Here you can point to this gene. This was entity-recognized automatically in green, and tried to summarize it. So here it's summarized. You can see the summary parts, however the problem is this particular intersected space has hardly any documents in it. So there's not very much to summarize. You did get the right gene, but you didn't summarize it against the useful space. What you want to do is go switch this term over into this other space, into the Drosophilidae space, which has like 50,000 articles, and then summarize it again. So here's an article that has it in it. This one you can see has more entities automatically selected. Then here's the gene summary against that space, again, done on the fly. So this is a general summary facility that if you have an entity and you have a space, so you have a specific term and a collection, you want to see what's known about it in that collection. This is a type of summary you can do while you wait. It's a scalable one. You can break it into a well-known category. You can rank order the sentences in those particular categories. This is kind of like a new summary, but it's a semantic type new summary. Then if you went into that, you would see that there lots of entities recognized here. All the things in green were done automatically, and if you pointed to these you would then summarize those in this space, or you can go off to another one and summarize it. So I want to just say each one of these main features was done dynamically on new collections, while you wait. And you can basically expand the searches, you can take a search that's too big and break it into pieces, you can make a new space and do algebra, do intersection on it, or if you find a particular entity, you can summarize it in different ways. Those are examples of the kinds of things that you can all do automatically. So the message is these are all general, and if you have to do biology you sort of work up towards the inter-space where your intersecting all the spaces using these sets of ones, by doing birds and bees and pigs and cows and brains and behavior. These are actually all projects I'm working on. I work in the genome center where this project is going on. So it is a birds and bees in pigs and cows project in some respect. Let me now conclude to allow some time for questions by just saying this is actually quite a different world. It's not pile all the world's knowledge in one big place. It's have many small little ones, including ones that are sort of dynamic communities that are made on the fly. And because of that every person that's doing it is actually doing just about everything there-- indexing it, using the system, they're making new collections, they're authoring materials themselves. And the system itself could be occurring all in one big server. And ours, of course, does. But it could also occur in many small places. It's a very small, localized kind of system. My guess is if you had to do this on 10 trillion, which is what's going to be true in a decade on the web, then you wouldn't have four or five big servers that cover the world. What you'd have is at the end of every block, or you'd have a hierarchy like the telephone network used to where you'd have servers that actually handled each set of spaces that they were doing live manipulation against. It's quite a different world. It's much more like the virtual worlds that the kids today wander around. Maybe you all are a little bit too old to spend all your time on Neopets or even on Second Life. So I promised I would end with what's a grand project you could do. So one grand project you could do is take some set of people, like university is very convenient because you can force undergraduates to do just about anything. If you're at the University of Illinois, there's 35,000 of them. There's quite a few of them. There's few less because some of them came here. And you capture all the text, the library and the courses actually where-- our library has just gone into the Google Books program, and all the context which tries to do the relationships, partially by live with this kind of system, and partially by-- well, I guess this is actually OK to say but, if you gave everyone at the University of Illinois free Gmail and a free Gphone-- I guess the Gphone isn't announced yet, but there's lots of rumors on the web that there will be one. Anyway, if you gave everybody a free email and phone and said with the proviso that we're going to capture everything you ever do and we're going to use it for good purposes, not selling you ads, but trying to relate things together to help you understand the context, then the university would be delighted because they'd like to educate, not just the people on campus, but people all over the world and make money charging them tuition. People at Google might be delighted because normally you couldn't do this experiment because you would get sued out of existence, even with your lawyers I would guess, if you tried to surreptitiously capture all the Voit that was coming out of Gphone. That's not proposed is it? I've had people tell me the University of Illinois might refuse to have it done. But if the undergrad takes it, that's the deal, right? They take it as long as we're going to record everything. You might really be able to build a semantically-based social network, so you're not sharing a YouTube video by it's got the same little tag on top of it, but by some of real, deep, scalable semantics underneath. So that's all I have to say, and I did promise I would put some bees at the end. So someday we will do hive mine, and it probably will be in your guys' lifetime, but not in mine. That's all I have to say. Thank you. [APPLAUSE] Question, yes? AUDIENCE: I was wondering-- [SIDE CONVERSATION] AUDIENCE: I was wondering could you use the semantic relationships that you've built up to debug the language itself? In other words, create some kind of metric that detects whether the description or the expression of a particular concept is coherent or incoherent, and essentially flag places where the terminology is insufficiently expressive. BRUCE SCHATZ: Could you hear the question or should I repeat it? OK. The question was can you regularize the language since you're now detecting all these patterns? That's actually been done quite a bit with tagging to quite a large degree of success. So the reason that our digital library project succeeded and the one at Elsevier, which was a big publisher failed, is we had a set of programs that went through and automatically cleaned up the tagging, the structure tagging, that was coming back from the publishers that the authors had provided, and then sent corrective information to the author's telling them what they should have done. But the things that went into our system were the cleaned up ones. It's what data mining people call cleaning the data. It is true that the more regular things are the better they work, so that if you tried to do a chat session, like an IM text messaging, it would work much worse than it did with biology literature, which is much more regularized. The general experience with these kinds of systems is that people are much better at hitting the mark than computers are at handling variability. So it's kind of like those handwriting recognizers that you learned how to write [UNINTELLIGIBLE]. So my guess is that yes, the users are trainable. And if I tried to do this with undergrads, I would certainly do things like fail people that got in too many-- you know, it's like if you programs don't parts correctly than you don't get a passing grade. It's a problem though. The more regular the world is, the better this brand of semantics does. Is there another question? Yes. AUDIENCE: I will start with a simple practical question. When I go to PubMed and ask for references including phytic acid, it knows that phytic acid is inositol hexakisphosphate. Is there any automation in that process, or is that just a laborious transcription process on the part of human being. BRUCE SCHATZ: OK, if you're asking what PubMed does, the answer is they have a big translation table with all those wired in. It's because they're a large organization with a lot of libraries. They're actually able to provide a large set of common synonyms to things. If you have an automatic system it can't do that. Well, actually ours is sort of a hybrid system. Ours actually uses the synonyms like that that PubMed has as a boost to finding equivalent ones. If you're not able to do that, there's a whole set of linguistics processing that tries to find things that are synonyms to different degrees of success. It looks for things that are in the same slots and sentences. It looks for equivalent sentences that had different subjects that were used the same. It uses ways that acronym expansion are commonly done. There's a set of heuristics that work some of the time, maybe two-thirds of the time in regularized text like this. But they're not perfect in the way-- the ones you're seeing are all human generated, and that's why they're so good. You will always use human generated ones if you could, and in fact, it's very likely when I give a more popular version of this kind of talk, what people point out is even though the kids on the block that maintain the cat-- the one about cats. You know, the small, specialized collections. Even though they're not willing or probably not able to do semantic mark-up, they are able to do lots of other creation. They are able to show typical sentences, they are able to do synonyms. And there may be a lot of value added that comes in at the bottom that improves each one of these community collections. I expect that that will be a big, big area when it becomes a big commercial thing. You'll need to have the users helping you to provide better information, by better context. Yes, Greg? AUDIENCE: Remember-- don't go all the way back-- I remember the slide about functional phrases, and it seemed that in the three examples that were on that slide, there were of the form, something I might call a template predicate. In other words, A template relates to B. You seem to be saying that the system automatically drive those templates from analyzing the text. Is that correct? BRUCE SCHATZ: That is correct. AUDIENCE: So my question then is this. Can you compare and contrast that technique of producing templates to two other things. Number one, the system that the Cyc guys did to make-- BRUCE SCHATZ: [INAUDIBLE]. AUDIENCE: --to make predicates, but starting from a different point and ending in a different point, although they have predicates. That's comparison number one. Comparison number two is with respect to, let me just call them template predicates for lack of a better word. If you have those and you created them solely from deriving them from text, then you don't have world knowledge. You basically have knowledge that just came from the documents. It seems to me that getting from the one to the other is what Cyc was trying to do, but I understand that since they were doing it by hand they abandoned that and they're now trying to do automatic techniques. So that thread of thought seems to be in the same ballpark as what you're trying to do here, but with a different approach. I was wondering if you can compare and contrast, and maybe there's a third area of endeavor trying to get to that next step up that maybe you could educate us about. BRUCE SCHATZ: Yeah. That is a very, very good comment. For those of you that don't know what Cyc is, C-Y-C. It was a very ambitious attempt at MCC to try to encode enough common sense knowledge about all of the world so that it could automatically do this kind of thing. As Greg said, it was largely a failure. So let me sort of say what the spectrum of possible things is as a longer answer. Am I running over my time? Is it OK? MALE SPEAKER: It's lunchtime for a lot of these people. Let's say another five minutes and then we'll formally break, and then people who want to hang out it's OK, we got it. [SIDE CONVERSATION] BRUCE SCHATZ: I usually get lunch people by saying there's free food, but that doesn't work here. AUDIENCE: We all work for food. BRUCE SCHATZ: You all work for food. So Greg asked a very good question about where's the line in automaticness. Well, the old way of solving this problem used to be you had a fixed set of templates. What that essentially hit a wall with is each small subject area needed a different set of templates, and it was a lot of work to make the templates. So then they were a set of people that said you needed, if you had a small amount of basic world knowledge, you wouldn't need the templates, you could automatically make training examples. The problem is that that could rarely, only in very isolated cases, could do even as good a tagging as what I am showing. What most of the people do now and what most of the examples that I was showing are is a human comes up with a set of training examples of what are typical sentences with genes in them in this particular subject domain. Then the system infers, exactly as you said, the system infers what the grammar is, what the slots are going to be. There's a few people experimenting with two automatic things, and they don't work at present, but my belief is in the next year or two you'll see research systems with it. If you had a concerted commercial after it you could probably do it and get away with it, it just wouldn't work all the time. They're essentially either trying to automatically make training sets, so you start out with the collection and you try to pull out sentences that clearly have some slots in it and then just infer things from that. Or they try to automatically infer tags, infer grammar. So you know some things, like you know body parts and you know genes, and the question is can you infer behavior, because you know in slots, you already have slots in the particular subject domain. My feeling is one of those two will work well enough so that you can use it automatically and it will always do some kind of tagging. It won't be as accurate as these, which are generally correct. And it could either just be left as is, so it's like a baseline of everything is tagged and 60% of them are correct and 30% of them are ridiculous, but 60% buys you a lot. Or they could be the input to humans generating it. So the curators I'm working with in biology, we already have a couple pieces of software that do this. They don't have to look at all the sentences in all the documents. We give them a fixed set of sentences that are sort of these are typical sentences that might be ones you'd want to look At And then they extract things out. So there's a human step afterwards that does a selection. Almost all the statistical-- I went kind of fast through it-- but almost all the statistical programs don't produce correct answers. They produce ranked answers that the top ones are sort of in the next band. My expectation is the tagging will be like that. So the practical question is which things are good for which kind of text. So I guess we have time for another question if anyone has one. MALE SPEAKER: You actually had another one because you started with your easy one. BRUCE SCHATZ: Should we take should someone else? MALE SPEAKER: Let me just suggest, because it is getting close to lunchtime, let me suggest one last basic question. Given all the information about bees, have you been able to figure out why they're disappearing? BRUCE SCHATZ: It turns out actually we have a summer workshop on exactly that topic. And the answer is, like most things about bees, nobody knows. MALE SPEAKER: So much for that idea. OK, well thank you very much, Bruce. We appreciate the time. Those of you who want to hang out, Bruce has time to stay this afternoon. We can all have lunch together. BRUCE SCHATZ: I'm generally hanging out today and tomorrow morning, and there's a lot of stuff about the system up on the BeeSpace site, which you're welcome to look at. And the slides are also going to be made available if you want to flip through them. Thank you everyone for staying through the whole thing.
B1 Google bruce gene space knowledge system Towards Telesophy: Federating All the World' s Knowledge 228 12 Hhart Budha posted on 2014/06/12 More Share Save Report Video vocabulary