Placeholder Image

Subtitles section Play video

  • MALE SPEAKER: This is my attempt to increase the

  • sartorial quotient of Google, and it hasn't worked at all.

  • On the other hand--

  • well, I noticed you have a coat on, that's true.

  • Greg Chesson gets two points for showing up with a coat.

  • It's a real pleasure to introduce Bruce Schatz to you.

  • I've known Bruce for rather a long time.

  • My first introduction to him came as we both began getting

  • excited about digital libraries and the possibility

  • of accumulating enormous amounts of information in

  • digital form that could be worked on, manipulated by,

  • processed through software that we hope would augment our

  • brain power.

  • So Bruce has been in the information game for longer

  • than he's actually willing to admit I suspect.

  • He's currently at the University of Illinois,

  • Champaign-Urbana.

  • As you will remember, that's also the area where the

  • National Center for Supercomputer

  • Applications is located.

  • Bruce was around at the time when Mark and Jason was doing

  • work on the first browsers, the mosaic versions of the

  • browsers derived from Tim BernersLee's work.

  • Actually, the one thing that Bruce may not realize he gets

  • credit for is teaching me how to pronounce

  • caenorhabditis elegans.

  • I looked at it before and I couldn't figure out, and maybe

  • I didn't even say it right this time.

  • But this is a tiny little worm that consists of 50 cells.

  • It was the first living organism that we actually

  • completely sequenced the genome for.

  • Then we got interested in understanding how does the

  • genome actually reflect itself as this little worm develops

  • from a single fertilized cell.

  • So Bruce introduced me to the idea of collecting everything

  • that was known about that particular organism, and to

  • turn it into a database that one could manipulate and use

  • in order to carry out research.

  • Well, let me just explain a little bit more about his

  • background and then turn this over to him, because you're

  • here not to listen to his bio, but to listen to

  • what he has to say.

  • He's currently director of something called CANIS--

  • C-A-N-I-S. I thought it had to do with dogs

  • until I re-read it.

  • It says Community Architecture is for Network Information

  • Systems.

  • BRUCE SCHATZ: That's why they let me in the building.

  • MALE SPEAKER: I'm sorry.

  • BRUCE SCHATZ: That's why they let me in the building.

  • MALE SPEAKER: Because along with the other

  • canines that are here.

  • It's at the University of Illinois, Champaign-Urbana,

  • and he's been working on federated all the world's

  • knowledge, just like we are, by building pioneer research

  • systems in industrial and academic settings.

  • He's really done a lot of work over a period of 25 or 30

  • years in this domain.

  • The title of the talk uses the term telesophy, which he

  • introduced as a project at Belcorp in the 1980s.

  • Later on, he worked at UIUC on something called DeLIver

  • D-E-L-I-V-E-R, and now more recently on semantics.

  • That's the reason that I asked him to come here.

  • He's working on something called BeeSpace, which is

  • spelled B-E-E, as in the little buzzing organism.

  • This is an attempt as I understand it, but I'm going

  • to learn more, an attempt to take a concept space and

  • organize it in such a way that we can assist people thinking

  • through and understanding more deeply what we know about that

  • particular organism.

  • So this is a deep dive into a semantic problem.

  • So I'm not going to bore you with any more biographical

  • material, except to say that Bruce has about nine million

  • slides to go through, so please set your modems at 50

  • gigabits per second because he's going to have to go that

  • fast to get through all of it.

  • I've asked him to leave some time at the end for questions.

  • I already have one queued up.

  • So Bruce, with that rather quick introduction, let me

  • thank you for coming out to join us at Google and turn

  • this over to you to teach us about semantics.

  • BRUCE SCHATZ: Thank you.

  • I have one here, so you can just turn yours off.

  • Thank you.

  • I was asked to give a talk about semantics, which I

  • supposedly know something about.

  • So this is going to be both a talk that's broad and deep at

  • the same time, and it's going to try to do something big and

  • grand, and also try to do something deep that you can

  • take away with it.

  • So that may mean that it fails completely and does none of

  • those, or maybe it does all of those.

  • I've actually been giving this talk for 25 years and--

  • now, of course, it doesn't work.

  • Am I not pointing it in the right place?

  • I'm pushing it but it's not going.

  • Oh, there it goes.

  • OK, sorry.

  • Can you flip it back there?

  • Sorry about that.

  • Small technical difficulty, but the man behind the curtain

  • is fixing it.

  • So I gave this talk first more than 20 years ago in the hot

  • Silicon Valley research lab that all the grad students

  • wanted to go to, which was called Xerox PARC.

  • I think a few people actually have heard of Xerox PARC.

  • It sort of still exists now.

  • We went down completely?

  • There we go.

  • Thank you very much.

  • I was pushing this idea that you could federate and search

  • through all the world's knowledge, and the uniform

  • reaction that was, boy, that would be great,

  • but it's not possible.

  • And I said, no, you're wrong.

  • Here, I'll show you a system that searches across multiple

  • sources and goes across networks, and does pictures

  • and text and follows links, and I'll explain each piece

  • about how it works.

  • Then they said, that's great, but not in our lifetime.

  • Well, 10 years later was mosaic and the web.

  • And 20 years later I'm delighted to be here, and all

  • of you have actually done it.

  • You've done all the world's knowledge to some degree.

  • What I want to talk about is how far are you and what you

  • need to do before you take over the rest of the world and

  • I die, which is another 20 years.

  • So what's going to happen in the next 20 years.

  • The main thing I'm going to say is a lot's happened on

  • tele, but not too much on sophy.

  • So you're halfway to the hive mine, and since I'm working on

  • honey bees, at the end you will see a picture of honey

  • bees and hear something about hive mines, but it will be

  • very short.

  • Basically, if you look at Google's mission, the mission

  • is doing a lot about access and organization of all the

  • world's knowledge.

  • Actually, to a degree that's possible, you do an excellent

  • job about that.

  • However, you do almost nothing about the next stages, which

  • are usually called analysis and synthesis.

  • Solving actual problems, looking at things in different

  • places, combining stuff and sharing it.

  • And that's because if you look at the graph of research over

  • the years, we're sort of here, and you're doing commercially

  • what was done in the research area about 10 years ago, but

  • you're not doing this stuff yet.

  • So the telesophy system was about here.

  • Mosaic was about to here.

  • Those are the things that

  • searching across many sources--

  • like what I showed, we're really working pretty well in

  • research labs with 1,000 people.

  • They weren't working with 100 million.

  • But if Google's going to survive 10 more years, you're

  • going to have to do whatever research systems do here.

  • So pay attention.

  • This doesn't work with students.

  • With students I have to say I'm going to

  • fail you at the end.

  • But you have a real reason, a monetary reason, and a moral

  • reason to actually pay attention.

  • So back to the outline.

  • I'm going to talk about what are different ways to think

  • about doing all the world's knowledge, and how to go

  • through all the levels.

  • I'm going to do all the levels and sort of say you are here,

  • and then I'm going to concentrate on the next set of

  • things that you haven't quite got to.

  • The two particular things I'm going to talk about our

  • scalable semantics and concept navigation, which probably

  • don't mean anything to you now, but if I do my job right,

  • 45 minutes, actually now 10 of them are up, so 35 minutes

  • from now they will mean something.

  • At the end I'm going to talk about suppose you cared about

  • this enough to do something, what kind of big thing would

  • you actually do?

  • I sort of do these big, one of the kind pioneering projects

  • with stuff that doesn't quite work just to

  • show it's really possible.

  • So the overall goal is you probably all grew up on

  • reading cyberspace novels is sort of plugging your head and

  • being one with all the world's knowledge.

  • Trying to sort of get the concepts in your head to match

  • whatever is actually out there in a way that you can

  • get what you want.

  • The problem is over time what the

  • network can do has increased.

  • So in the--

  • I can't say the old days, man--

  • in the good days, people worked on packets and tried to

  • do data transmission.

  • The era that I sort of worked mostly in was an object era

  • where we try and give the information to people to do,

  • [UNINTELLIGIBLE] to do pictures.

  • All the action in big research labs now is on concepts, is on

  • trying to do deeper things, but still it

  • worked like these too.

  • They work everywhere.

  • So you don't have a specialized AI program that

  • only works for income taxes.

  • That's not good enough.

  • No Google person would ever do something that only works in

  • one case, unless there was a huge amount of

  • money behind it.

  • I'll stop making money comments, but the food is

  • great here.

  • So this is one common layout, and there's four or five

  • others, which in the absence of time, I will omit.

  • But if you want to talk to me afterwards, there's lots of

  • points of view about how to get from here to there, where

  • there is always all the world's knowledge, and here is

  • whatever you can do now.

  • Depending on what point of view you take, it's possible

  • to go to the next step differently because you have a

  • different orientation.

  • So the one that I'm going to do in this talk is the

  • linguistic one, which usually goes syntax, structure,

  • semantics, pragmatics.

  • So syntax is what's actually there, like an actual set of

  • bits in a file, a set of words in a document.

  • Structure is the parts, it's not the holes.

  • So if you part something in structure, you can tell that

  • this particular thing is a person's name, this is the

  • introduction to a paper, this is the methods part.

  • You can tell what the parts are and you can search those

  • differentially.

  • Semantics is when you go inside and you try to get

  • something about the meaning, and as you'll see, people have

  • pretty much given up on doing real meaning, and they pretty

  • much try to do, rather than meaning,

  • they try to do context.

  • What's around it in a way that helps you understand it.

  • Actually, when Google was a research project, and the

  • people that started it were actually on the Stanford

  • Digital Library Project, I was running the Illinois Digital

  • Library Project at the same time, they said there's enough

  • context in web links to be able to really do something.

  • There were a lot of people that said no, web links are

  • made for all sorts of things, and they don't have any

  • semantics, and they're not useful at all.

  • But obviously, they were wrong enough to make this building

  • and employ all of you.

  • The real goal is down here in doing actual reality, in doing

  • with so-called pragmatics.

  • Pragmatics is sort of when you use something.

  • So it's test dependent.

  • The meaning of something is always the same.

  • so if this is a gene that regulates cancer,

  • it always does that.

  • But lots of time, the task you're working on varies what

  • you're interested in, what you know.

  • I'm not going to say very much about pragmatics because

  • people have gotten very far on it in terms of doing a big

  • grand scale.

  • But I actually know quite a bit about it.

  • If you really wanted to solve health care, for example,

  • you'd have to go down the pragmatic route and try to

  • measure people with as large a vector as you

  • can possibly get.

  • And again, if people are interested, that's a topic I'd

  • be happy to talk about, but it's off this particular talk.

  • This particular talk is about federation, as I said.

  • So what does it mean to federate each

  • one of those levels?

  • So to do syntax federation, which is what the telesophy

  • system pioneered, and for the most part, what Google does in

  • the sense of federating all the web sources that are

  • crawled, is it tries to make essentially send the same

  • query into every different place.

  • So true syntax federation, which is actually what

  • telesophy did, but not really what Google does, is you start

  • at your place and you go out to each one of the sources and

  • they have to remember where they are on the network.

  • They might go up and down, and so you might

  • have to retry them.

  • And you have to know what syntax the queries need.

  • And when the results come back, you have to know how to

  • handle that.

  • You have to do a lot about eliminating duplicates when

  • the results come back.

  • So a very common problem is you send out a query to try to

  • get a certain Beatles song, and you get back 5,000 of

  • them, but they're all slightly different, and they're in

  • different languages and they have different syntax.

  • Merging those all together is really complicated.

  • So that's what syntax federation is.

  • Structure federation, which is what this--

  • DELIVER was the DLI, the Digital Library Initiative

  • project that I ran at the University of Illinois.

  • It took about engineering literature, it went out to 10

  • major scientific publisher sites on the fly and allowed

  • you to do a structured query.

  • So you could say find all the papers in physics journals

  • that are within the last 10 years that mention

  • nanostructures in the figure caption in the conclusion.

  • So you're using the parts of the papers to make use.

  • And at least scientists make a great deal of

  • effort in doing that.

  • In order to do that, you have to figure out some way of

  • making the mark-up uniform.

  • So you have problems that you just started to see in

  • syntactic world where who's an author?

  • If you have a physics paper that has 100 authors, which

  • one of them is the author?

  • It might not be any of them actually, it might be the

  • organization that did it.

  • Or if you have a movie, who's the author of a movie?

  • Is it the producer, the writer,

  • the star, the director?

  • So there's a lot of problems there in how you do the

  • mark-up uniformly and how you make

  • different values the same.

  • For the most part, structure has not made it into mass

  • systems yet, although there have been a lot of attempts to

  • try to make languages for structure like the semantic

  • web that Vin and I were talking about beforehand.

  • But the amount of correctly marked-up structured text is

  • very small right now.

  • So if you were going to use it to search the 10 billion items

  • that you can crawl on the web now, you

  • wouldn't get very far.

  • Semantics federation, which is what I'm going to talk about

  • today mostly, is about a completely different topic.

  • It's about going inside and actually looking at the

  • phrases and figuring out the meaning, as much of the

  • meaning as you can.

  • And then when you have many small pieces, trying to match

  • something that's the same here to something the same here.

  • And doing that uniformly is the job of semantics

  • federation.

  • So let me now go into the first of the

  • two technical topics.

  • So the first topic I'm going to do is how do you actually

  • represent the things, and that's going to be a little

  • slow going.

  • Then I'm going to give some examples of if you're able to

  • get this deeper level representation, this deeper

  • level structuring, what kind of system you can build.

  • It's in a somewhat specialized domain.

  • It's in biology and medicine, because well, if you're a

  • professor and you work at a university, that's where you

  • can get money to work on things.

  • You can't get money to work on the kind of things that are

  • arbitrarily on the web.

  • So scalable, so we're now into scalable semantics.

  • I've been using this for 10 years, and every once in while

  • someone will stand up and say that's an oxymoron, it doesn't

  • make sense because semantics means really deep, and

  • scalable means really broad, and those pull in opposite

  • directions.

  • And I said yes, you understood what the problem is.

  • So in the old days, what it used to mean is--

  • what semantics used to mean is you do deep meaning.

  • So you had a deep structure parser that would go in and

  • figure out yes, this document was on operating systems that

  • only work on this class of computers, and only solved

  • this class of physics problem.

  • So it's on a very narrow, detailed topic.

  • There were many, many AI systems made that did that.

  • What happened when the government started putting

  • large amounts of money into it-- so most of this got

  • developed in the--

  • the base technology got developed in the DARPA Trek

  • program trying to read newspaper articles looking for

  • what would now be called terrorists.

  • What they found basically is the deep

  • programs were very narrow.

  • If you trained something to recognize income taxes, or you

  • trained something to recognize high-powered rifles, it

  • wouldn't help at all in the next one.

  • And there were just too many individual topics to try to

  • pick out the individual types of sentences

  • and individual slots.

  • So what happened is the broad ones beat out the deep ones

  • when the machines got really fast. When it became clear,

  • and I'll show you some machine curves, when it became clear

  • that you could actually parse noun phrases arbitrarily out,

  • then people begin using noun phrases.

  • When it became clear you could do what are called entities,

  • in other words, you could say this phrase

  • is actually a person.

  • This phrase is actually someone that lives in

  • California.

  • Then people started using it.

  • Basically what happened is semantics changed from being

  • we know everything about this particular topic and this

  • phrase means one, it's meaning type 869, to we have 20 kinds

  • of entities, and this is a gene, and it occurs with his

  • other gene.

  • So we'll say if you search for this gene and it doesn't work,

  • you should search for this other one.

  • I'll show you lots of cases where that sort of guilt by

  • association really helps.

  • I'm not defending it necessarily as being real

  • semantics, I'm defending it as something that you can do

  • everywhere.

  • So the upshot is this is an engineering problem.

  • It's a question of if you could do deep parsing and say

  • yes, this person wasn't--

  • it's true they said they were interested in ice cream cones,

  • but they really meant pine cones when they said cone,

  • then you would do that.

  • But it's generally not possible to do that, except in

  • very isolated circumstances.

  • So you end up thinking globally, thinking about all

  • possible knowledge, but acting locally.

  • I guess this is a green building so I'm allowed to

  • make this kind of joke.

  • So you look at a small, narrow collection, and analyze the

  • context, what occurs with each other very precisely, and do

  • something there.

  • And that creates one good situation.

  • In other words, it means now you're able to go much deeper,

  • and I'll show you lots of examples of going much deeper.

  • But it creates one bad situation, which is

  • traditionally information retrieval works like dialogue

  • did in my era, or like Google does now.

  • You take everything you can get, and pile it into one big

  • huge server farm, and then you search it.

  • You index it once in one big index and you search it.

  • Well, the problem is if you want to go deeper in semantics

  • that doesn't work, because you mixed

  • together too many things.

  • You have to unmix them, and then you have to worry about

  • to get from here to there.

  • So you change a central problem into a distributed

  • problem with all of the hard features that go with

  • distribution.

  • What this is doing in terms of-- if you

  • want physical analogy.

  • For many years I taught at a library school.

  • The way indexes work in the real world is for really big

  • topics like if you have electrical engineering,

  • there's a society that is big enough and well-defined enough

  • to employ people to tag every topic.

  • So they say here's an article about Windows, this one is

  • about operating systems. Here's an article about

  • Windows, this one is about heat conservation.

  • A person who's looking at that, and out of their

  • selection of all the topics, they say which topics the

  • things are on.

  • That worked fine as long as most the information in the

  • world was in these large, but fairly small number of

  • well-defined databases.

  • That's not the world we're living in now.

  • We're mostly living in this world.

  • So there still are a very large number of big formal

  • databases that are done by hand, but nearly all the

  • databases, nearly all the collections, are these

  • informal ones with communities or groups or individuals.

  • The advance of crawling technology that's been able to

  • take all these and collect them all together into one big

  • place has actually made the problem worse because now

  • there's not only apples and oranges and pears altogether,

  • but there's lots of things that aren't fruit at all and

  • aren't really anything, but they're in there.

  • So there's many different things that you don't know how

  • to deal with, and you have to do something

  • automatically with them.

  • It's not the case that you can get--

  • my daughter who keeps track of all the cats on the block and

  • has a website with their pictures, it's not the case

  • that you can get her to employ a professional curator from

  • the library school who will tag those correctly so that

  • someone who's a cat fancier in the next town can see them.

  • That's not true.

  • Need some kind of automatic support.

  • So I'm going to talk about the automatic support.

  • I'm doing OK for time.

  • There's two things.

  • I'm going to talk about entities and I'm going to talk

  • about concepts.

  • So here are entities.

  • What entities are is trying to figure out what type of thing

  • something is.

  • So one way is you have hand-tagged XML, like the

  • mark-up, like the semantic web.

  • So they take a particular domain and they say there are

  • 20 types here, and we'll mark-up

  • each document correctly.

  • So if we're in humanities that might work pretty well.

  • This is a person, this is a place, this is a type of vase,

  • this is a time period in Roman history.

  • If you're out on the web in that situation where 90% of

  • the stuff's informally, then even if there was a systematic

  • set of types, the people aren't going to do it.

  • So if you have well marked-up hand ones you're going to use

  • them, but if you don't then you have to

  • do something automatic.

  • The thing that tends to work automatic is to try to tag

  • things by machine with training sets, and I'm going

  • to say a little bit about what that means.

  • First you go into the document and you pull out the phrases.

  • So you don't do whole words.

  • And in fact, over time the experimental system I've built

  • have gotten better the more you can get away from words

  • and change them into whole phrases that are the

  • equivalent phrase that works in that particular domain.

  • Right now search engines don't do that.

  • That's a big part of the problem.

  • Then you have to recognize the part of speech.

  • Is it a noun or a verb or an object.

  • Again, 10 years ago, you needed a specialized grammar

  • and it only worked in a particular subject.

  • Now there's things trained on enough machine learning

  • algorithms, trained on enough things that you can get very

  • high accurately, up in the high 90s,

  • with parts of speech.

  • And in fact, they're actually systems, and you can tell this

  • was secretly funded by the CIA under some other name that

  • recognized person, places, and things pretty accurately.

  • So if you want to recognize newspaper articles and

  • automatically tag these correctly, it actually does a

  • pretty good job.

  • Again, commercial search engines tend not to use those.

  • So here's an example of entities in biology.

  • These won't mean very much, but they'll

  • give you the feeling.

  • Here's a kind of functional phrase.

  • A gene is a type of an entity and encodes a chemical.

  • So here's an example.

  • The foraging gene encodes a cyclic GMP protein kinase.

  • So this is one of the entities and this is the other entity.

  • In scientific language things are very regularized, so

  • there's lots of sentences that are actually that easy.

  • Or here's another one.

  • Chemical causes behaviors.

  • Here's one that's a little harder.

  • I tried to put one a little harder.

  • This one says gene regulates behavior, but that's not in

  • the sentence.

  • What's actually in the sentence is this gene, which

  • is a ortholog of this other gene-- so it doesn't say

  • directly, it says indirectly it's a gene--

  • is involved in the regulation, which is not the same phrase

  • as regulates.

  • So you have to do a little bit of parsing to get a phrase

  • like gene regulates behaviors.

  • But the natural language technology is now good enough

  • to do that accurately.

  • I did do a little bit of prep and look at some of the

  • commercial systems that were doing this.

  • If you want to ask a question about those

  • later I'll make a comment.

  • But they're all competitors, they're not Google, so I

  • didn't want to say up front.

  • The last comment I'm going to make about entities is they

  • come in different varieties.

  • That means that sometimes you'll do them and

  • sometimes you won't.

  • So there's some of them, and again,

  • these are biology examples.

  • There's some of them that are just straight lists, so the

  • names of organisms, like honey bee or fruit fly are almost

  • always exactly those same words.

  • So those are easy entities to tag very accurately.

  • Things like genes or parts of the body vary somewhat, but

  • there often are tag phrases that say this is the part of a

  • body and here it is.

  • It's a wing.

  • Or this is a gene and it's the foraging gene.

  • So there are often tags there.

  • If you get training sets you do pretty well.

  • Then there's really hard things like what kind of--

  • these are sort of functional phrases--

  • what kind of behavior is the honey bee doing?

  • What kind of function does the computer operate with?

  • Those ones are almost always different, so you need a

  • really big training set to do those accurately.

  • If you were going to try to do entities across all the

  • world's knowledge, you would have two problems. I think

  • that's the last thing I'm going to say on this, yes.

  • The first is you would have to try to make a run at the hard

  • ones, or at least say well, we're only going to do these

  • because that's all we can do uniformly.

  • The second thing is you have to realize that the entities

  • are different in each major subject areas.

  • So the biology ones are not the same as the medicine ones,

  • which are more disease-like, and the medicine ones aren't

  • the same as the physics ones, and the physics ones aren't

  • the same as the grocery store ones.

  • My guess is there's a relatively limited number of

  • popular ones, and if you're back in the same style that,

  • trying to classify all the web knowledge, like Yahoo!--

  • that used to be Yahoo!'s main strategy, for instance.

  • That there are a couple hundred really important ones

  • and a couple thousand big ones.

  • So if you had enough money and enough expert teams and set

  • each one up to making training sets, you could actually do

  • entities all the way across.

  • A research project can't muster that except in one

  • small area.

  • That's all I'm going to say about entities.

  • Now, let me explain just a little bit about what you do

  • with entities, and then give a big example.

  • So what you do with entities, you know you might think

  • you're going to answer questions with them, and

  • that's what the commercial systems are doing.

  • You can sort of answer questions, so you can say this

  • gene seems to affect this behavior and this organism.

  • So you can say what are all the things that affect

  • foraging in insects and get out lots of--

  • this is sort of you have a relational table.

  • You take a document and change it into a relational database.

  • You can answer that kind of question, but there's lots of

  • kinds of questions you can't answer.

  • What you can do is after you extract these entities, these

  • units, is you can compute these context graphs.

  • You can see in this document how often do these two things

  • occur together.

  • That one you get a lot of mileage from, because if you

  • try to search for this one and you can't find it, you can

  • search for this other one.

  • Or if you're trying to search for this one and you can't

  • find it, you can go down the list of the ones it commonly

  • occurs with and it's sort of a suggestion facility.

  • People that watch search in libraries, what they typically

  • comment on is people don't know what words to try.

  • They'll try all the words they can think of and then they'll

  • start searching dictionaries or looking at other papers or

  • asking the people next to them.

  • So since you can automatically do suggestion by making this

  • graph of all entities that are related to all the other

  • entities in terms of how often they occur together in a

  • collection, then you can use it for suggestion.

  • This is my computer engineering slide.

  • The other unusual feature about Google that people

  • didn't predict is could you build a big enough

  • supercomputer to handle 10 billion items. And dialogue

  • would have said no because IBM will not sell you that many

  • platters and that big a thing.

  • Well, what they didn't realize was what the rise of PCs would

  • do if you hook things together and you could partition the

  • problem enough.

  • The research people hit that same curve a decade earlier.

  • So I was trying to do these relations--

  • this is my six or seven year history.

  • These are all how big a collection you can do and find

  • these entities and these relations basically on

  • workstations.

  • So this is like a Sun-2, and that's a Sun-3, and this is a

  • network of Sun-3's, about 10 of them.

  • This one is discovering supercomputers at NCSA and you

  • could get 1,000 all at one time.

  • That made a big difference, and it meant-- in fact, this

  • was a big hero experiment.

  • It was the first supercomputer computation

  • in information retrieval.

  • For quite a while, it was the biggest computation that NCSA

  • had ever done.

  • They couldn't figure out why you'd want to integrate all

  • the world's knowledge.

  • Why would anybody want to do that?

  • I think in 1998, Google was probably about 10 employees.

  • So that question hadn't come up yet.

  • The number of articles in Medline was still much greater

  • than the number of articles on the web.

  • So here's what that computation was like.

  • It had about 280 million concepts, so that

  • number was big then.

  • It's now small.

  • However, if you fast forward to today, the machines are lot

  • faster, so the server I just bought for $25,000 has more

  • memory than that supercomputer eight years ago.

  • These are big memory computations.

  • You can guess it's got a big matrix inside that has all the

  • phrases versus all the phrases and how often they occur.

  • So the more physical RAM you have the better.

  • What it turns out is if you're able to put a connection

  • graph, so this is a graph of which terms are related to

  • which other terms all in memory.

  • And it's a nice graph, like a small worlds graphs, which

  • looks kind of like this.

  • So there's a group here that's all sort of connected and

  • another group here.

  • So it comes in groups.

  • That tends to be true of just about any kind of text that

  • people have seen.

  • Then you can find all the inter-relations really fast

  • because you don't have to look at this one versus this one

  • because you know that you can stop here.

  • So there's a way of ending the propagation.

  • What that means is you can now do things on the

  • fly while you wait.

  • So you don't have to pre-compute the collections

  • anymore, which is what we had to do before.

  • You can do a search, make a new collection, and then make

  • it semantics.

  • Then make it deep.

  • You can cluster it on the fly into little chunks, you can

  • find these inter-related graphs while you wait.

  • And that's while you wait with a $25,000 server.

  • If you had something the size of Google you could not only

  • do all the world's knowledge, which isn't that much

  • proportionally bigger, but you could also do deeper.

  • So that's why now I want to show you what a real

  • concepts-based system looks like so that you get some

  • feeling as to how different the interaction is.

  • Generally, there's two things in a space system.

  • One of them is called federation--

  • I've been talking about that before.

  • It's how do you go from one collection to another.

  • The other is called the integration.

  • It's if you have an entity what can you go out to.

  • We're talking about going across collection, and I

  • didn't mean to say this was replacing IP.

  • IP is under everything.

  • But what I meant is it's replacing words.

  • This was the first interspace system, the

  • one DARPA paid for.

  • There aren't words in this anymore.

  • When you point to this you get--

  • you get that whole phrase, simple analgesics, and all the

  • things that are equivalent to it phrase-wise after you do

  • all the linguistic parsing.

  • So it looks like a bunch of words, but it isn't.

  • It's a bunch of inter-related concepts and those are

  • uniformly indexed across all the sources.

  • So you can go from simple analgesics here to all the

  • concepts, all the phrases that it's nearby, to the ones that

  • are nearby there, to the documents, to which little

  • cluster it's in.

  • You can sort of go from concept to concept, the

  • concept across all the different sources.

  • The main reason I showed this was to just show that words

  • don't exist anymore.

  • You've got these deeper level thing, which I tried to

  • convince you earlier was possible to do.

  • Also because this DARPA project broke up in 2000 just

  • before 9-11, and DARPA decided, they yanked the plug,

  • and they didn't want to help analysts anymore.

  • Every person on the project went to work at Microsoft.

  • So it's entirely possible that Windows 2010 is going to have

  • all this stuff in it.

  • Yes, question?

  • AUDIENCE: That's the [INAUDIBLE].

  • BRUCE SCHATZ: Which one, this one?

  • AUDIENCE: The hexagon, [INAUDIBLE].

  • BRUCE SCHATZ: It is actually, and let me postpone that

  • because I'm going to answer it better later.

  • But basically, it's taking the document collection and

  • grouping it in to individual groups of documents which have

  • a similar set of phrases in them.

  • This is just a bad graphical representation of it.

  • But I'll give a good example of it later.

  • So yes, it's quite meaningful, and you'll see in the session

  • what its utility is.

  • So what I'm actually going to talk about now in the last

  • five minutes of my talk is this BeeSpace system.

  • It is about honey bees, so you're allowed to have cute

  • pictures and cute puns.

  • Yeah, see?

  • The college students don't ever laugh at this, but I

  • always thought it was funny, bee-havior.

  • So much for that.

  • It must be an age thing.

  • So the point of this system is you make many, many small

  • collections, and you know something, and you want to use

  • your terminology and your knowledge to

  • go somewhere else.

  • So you want to go from molecular biology into bees,

  • into flies, into neuroscience.

  • So I'm working with the person that actually is the national

  • lead on the honey bee genome.

  • What it does inside is basically uses this scalable

  • semantics technology to create and merge spaces--

  • you'll hear a lot about spaces in the next five minutes, so I

  • won't explain right now--

  • to try to find stuff.

  • So it's complete navigation, complete abstraction, but

  • finding things when you don't know what you started with.

  • Space is a paradigm, not a metaphor.

  • I hope I'm not offending any user interface people.

  • I'm not sure if Dan is still sitting in the back.

  • In other words, there really are spaces in there.

  • You take a collection, you make it into a space.

  • You can then merge two of them, you can pull out part of

  • it, you can break it into parts and make one part of

  • that the whole space.

  • So it's like you have all the world's knowledge and you're

  • breaking it into conceptual spaces which you can

  • manipulate.

  • You personally, plus you can share them with other people.

  • So it has quite a different character than you're trying

  • to do a search and you get a set of results back.

  • This particular one does do entities very universally, but

  • it only does concepts and genes because that's all this

  • subject area needed.

  • So please don't criticize that particular one.

  • It was chosen narrowly because we wanted to at least have one

  • that did those uniformly.

  • These are the main operations I'm now going to show you very

  • quickly through a session.

  • If you go to the BeeSpace site, which was on that bag.

  • It's beespace.uiuc.edu.

  • You can use the system and beat it to death, assuming you

  • can read Midline articles, which you may or

  • may not be able to.

  • So extract is going to take a space and figure out what all

  • special terms that distinguish that space and

  • have a way of searching.

  • Mapping is going to go back the other way.

  • It's going to take a space and break it into parts, and then

  • you can turn each one into a space itself.

  • This is space algebra, and this is the summarization.

  • If you find an entity, it does something with it.

  • This example probably don't care, but it's looking at

  • behavioral maturation.

  • It's looking at a honey bee as it grows up, it takes on

  • different societal roles.

  • It takes care of the babies, it goes out and forges for

  • food, and looking at that across different species.

  • So it's a complicated question.

  • It's not one that there's a well-defined answer to.

  • So now we're into the BeeSpace system, which is

  • running right now.

  • So you type behavioral maturation, you choose a

  • particular space that was already made out, it's

  • insects, it's about 100,000, and you do browse.

  • So that gets about 7,000 articles, which are here,

  • which is too much to look at.

  • The problem was behavioral maturation

  • wasn't the right term.

  • The first thing the system's doing is it's extracting.

  • It tries to go in and analyze the terms, the phrases, and

  • get out a more detailed set.

  • So that's issuing extract.

  • It automatically pulls out the most discriminating terms in

  • that collection, and you usually have

  • to edit it a little.

  • That's what I did here.

  • Then you can take those back and browse again.

  • It's not working.

  • Oh, did it go?

  • Yeah, I'm sorry.

  • There it is.

  • You got more items. 22,000.

  • AUDIENCE: That's not necessarily good if you were

  • trying to narrow it down.

  • BRUCE SCHATZ: The problem was you narrowed it down too much.

  • You didn't actually get the articles about behavior

  • maturation because a lot of them didn't say it.

  • What you want to get is all of the things that might be

  • interesting and then narrow it down.

  • So that first one was trying to expand it a little bigger.

  • It was doing sort of a semantic

  • version of query expansion.

  • Now the problem is this one is too many to actually look

  • through, and now I'm going to go back the other way and sort

  • of answer the question that was asked before.

  • So this is automatically taking that collection, and

  • while you wait, it's breaking it into a number

  • of different regions.

  • Here it's about 20-- some of them are off the page.

  • And the regions tend to be--

  • they're sort of these small worlds regions that tend to be

  • tightly on the same topic.

  • The topics are kind of hard to describe what the topics are

  • because they're automatic, they're not on some

  • well-defined term, but they tend to cluster together well.

  • The thing to notice is this was done while you wait, even

  • with this small server.

  • So the collection was made on the fly, and this mapping was

  • done on the fly.

  • This is an interactive operation.

  • The pre-computation wasn't about this.

  • You didn't have to can this.

  • So we take this particular region, and now

  • we're going to operate--

  • this is that, just that one cluster.

  • Now we're going to save it.

  • And see, now it's a fully fledged space just like all

  • the previous ones.

  • So we were sort of navigating through space, we found this

  • little collection is what we want, and now we're making it

  • into a space ourselves.

  • This one is now well-defined, it's about behavioral

  • maturation in a large about insects, and we wanted to look

  • at multiple organisms. So now we're going to start doing

  • space algebra.

  • We're going to start taking this and merging

  • it with other things.

  • So here I took the new space I just made and I'm intersecting

  • it with an old space.

  • Currently, that's just finding the documents in common, but

  • we're now working on fancier data mining to

  • try to other patterns.

  • So here's the 21 that have that feature.

  • If you look at this article, this article is, in fact,

  • about some basic receptor about Drosophilidae, which is

  • the fruit fly, an insect, but it's about--

  • well, I'm sorry it's not fishes.

  • Marine crustaceans are like lobsters.

  • But it's something that lives in the sea.

  • Since you now found something at the intersection of those

  • two, what you really wanted to do was describe the genes.

  • Here you can point to this gene.

  • This was entity-recognized automatically in green, and

  • tried to summarize it.

  • So here it's summarized.

  • You can see the summary parts, however the problem is this

  • particular intersected space has hardly any

  • documents in it.

  • So there's not very much to summarize.

  • You did get the right gene, but you didn't summarize it

  • against the useful space.

  • What you want to do is go switch this term over into

  • this other space, into the Drosophilidae space, which has

  • like 50,000 articles, and then summarize it again.

  • So here's an article that has it in it.

  • This one you can see has more entities

  • automatically selected.

  • Then here's the gene summary against that space, again,

  • done on the fly.

  • So this is a general summary facility that if you have an

  • entity and you have a space, so you have a specific term

  • and a collection, you want to see what's known about it in

  • that collection.

  • This is a type of summary you can do while you wait.

  • It's a scalable one.

  • You can break it into a well-known category.

  • You can rank order the sentences in those particular

  • categories.

  • This is kind of like a new summary, but it's a semantic

  • type new summary.

  • Then if you went into that, you would see that there lots

  • of entities recognized here.

  • All the things in green were done automatically, and if you

  • pointed to these you would then summarize those in this

  • space, or you can go off to another one and summarize it.

  • So I want to just say each one of these main features was

  • done dynamically on new collections, while you wait.

  • And you can basically expand the searches, you can take a

  • search that's too big and break it into pieces, you can

  • make a new space and do algebra, do intersection on

  • it, or if you find a particular entity, you can

  • summarize it in different ways.

  • Those are examples of the kinds of things that you can

  • all do automatically.

  • So the message is these are all general, and if you have

  • to do biology you sort of work up towards the inter-space

  • where your intersecting all the spaces using these sets of

  • ones, by doing birds and bees and pigs and cows and brains

  • and behavior.

  • These are actually all projects I'm working on.

  • I work in the genome center where this

  • project is going on.

  • So it is a birds and bees in pigs and cows

  • project in some respect.

  • Let me now conclude to allow some time for questions by

  • just saying this is actually quite a different world.

  • It's not pile all the world's knowledge in one big place.

  • It's have many small little ones, including ones that are

  • sort of dynamic communities that are made on the fly.

  • And because of that every person that's doing it is

  • actually doing just about everything there--

  • indexing it, using the system, they're making new

  • collections, they're authoring materials themselves.

  • And the system itself could be occurring

  • all in one big server.

  • And ours, of course, does.

  • But it could also occur in many small places.

  • It's a very small, localized kind of system.

  • My guess is if you had to do this on 10 trillion, which is

  • what's going to be true in a decade on the web, then you

  • wouldn't have four or five big servers that cover the world.

  • What you'd have is at the end of every block, or you'd have

  • a hierarchy like the telephone network used to where you'd

  • have servers that actually handled each set of spaces

  • that they were doing live manipulation against. It's

  • quite a different world.

  • It's much more like the virtual worlds that the kids

  • today wander around.

  • Maybe you all are a little bit too old to spend all your time

  • on Neopets or even on Second Life.

  • So I promised I would end with what's a grand

  • project you could do.

  • So one grand project you could do is take some set of people,

  • like university is very convenient because you can

  • force undergraduates to do just about anything.

  • If you're at the University of Illinois,

  • there's 35,000 of them.

  • There's quite a few of them.

  • There's few less because some of them came here.

  • And you capture all the text, the library and the courses

  • actually where--

  • our library has just gone into the Google Books program, and

  • all the context which tries to do the relationships,

  • partially by live with this kind of system,

  • and partially by--

  • well, I guess this is actually OK to say but, if you gave

  • everyone at the University of Illinois free

  • Gmail and a free Gphone--

  • I guess the Gphone isn't announced yet, but there's

  • lots of rumors on the web that there will be one.

  • Anyway, if you gave everybody a free email and phone and

  • said with the proviso that we're going to capture

  • everything you ever do and we're going to use it for good

  • purposes, not selling you ads, but trying to relate things

  • together to help you understand the context, then

  • the university would be delighted because they'd like

  • to educate, not just the people on campus, but people

  • all over the world and make money charging them tuition.

  • People at Google might be delighted because normally you

  • couldn't do this experiment because you would get sued out

  • of existence, even with your lawyers I would guess, if you

  • tried to surreptitiously capture all the Voit that was

  • coming out of Gphone.

  • That's not proposed is it?

  • I've had people tell me the University of Illinois might

  • refuse to have it done.

  • But if the undergrad takes it, that's the deal, right?

  • They take it as long as we're going to record everything.

  • You might really be able to build a semantically-based

  • social network, so you're not sharing a YouTube video by

  • it's got the same little tag on top of it, but by some of

  • real, deep, scalable semantics underneath.

  • So that's all I have to say, and I did promise I would put

  • some bees at the end.

  • So someday we will do hive mine, and it probably will be

  • in your guys' lifetime, but not in mine.

  • That's all I have to say.

  • Thank you.

  • [APPLAUSE]

  • Question, yes?

  • AUDIENCE: I was wondering--

  • [SIDE CONVERSATION]

  • AUDIENCE: I was wondering could you use the semantic

  • relationships that you've built up to debug

  • the language itself?

  • In other words, create some kind of metric that detects

  • whether the description or the expression of a particular

  • concept is coherent or incoherent, and essentially

  • flag places where the terminology is insufficiently

  • expressive.

  • BRUCE SCHATZ: Could you hear the question or

  • should I repeat it?

  • OK.

  • The question was can you regularize the language since

  • you're now detecting all these patterns?

  • That's actually been done quite a bit with tagging to

  • quite a large degree of success.

  • So the reason that our digital library project succeeded and

  • the one at Elsevier, which was a big publisher failed, is we

  • had a set of programs that went through and automatically

  • cleaned up the tagging, the structure tagging, that was

  • coming back from the publishers that the authors

  • had provided, and then sent corrective information to the

  • author's telling them what they should have done.

  • But the things that went into our system were

  • the cleaned up ones.

  • It's what data mining people call cleaning the data.

  • It is true that the more regular things are the better

  • they work, so that if you tried to do a chat session,

  • like an IM text messaging, it would work much worse than it

  • did with biology literature, which is much more

  • regularized.

  • The general experience with these kinds of systems is that

  • people are much better at hitting the mark than

  • computers are at handling variability.

  • So it's kind of like those handwriting recognizers that

  • you learned how to write [UNINTELLIGIBLE].

  • So my guess is that yes, the users are trainable.

  • And if I tried to do this with undergrads, I would certainly

  • do things like fail people that got in too many-- you

  • know, it's like if you programs don't parts correctly

  • than you don't get a passing grade.

  • It's a problem though.

  • The more regular the world is, the better this brand of

  • semantics does.

  • Is there another question?

  • Yes.

  • AUDIENCE: I will start with a simple practical question.

  • When I go to PubMed and ask for references including

  • phytic acid, it knows that phytic acid is inositol

  • hexakisphosphate.

  • Is there any automation in that process, or is that just

  • a laborious transcription process on the

  • part of human being.

  • BRUCE SCHATZ: OK, if you're asking what PubMed does, the

  • answer is they have a big translation table with all

  • those wired in.

  • It's because they're a large organization

  • with a lot of libraries.

  • They're actually able to provide a large set of common

  • synonyms to things.

  • If you have an automatic system it can't do that.

  • Well, actually ours is sort of a hybrid system.

  • Ours actually uses the synonyms like that that PubMed

  • has as a boost to finding equivalent ones.

  • If you're not able to do that, there's a whole set of

  • linguistics processing that tries to find things that are

  • synonyms to different degrees of success.

  • It looks for things that are in the

  • same slots and sentences.

  • It looks for equivalent sentences that had different

  • subjects that were used the same.

  • It uses ways that acronym expansion are commonly done.

  • There's a set of heuristics that work some of the time,

  • maybe two-thirds of the time in regularized text like this.

  • But they're not perfect in the way-- the ones you're seeing

  • are all human generated, and that's why they're so good.

  • You will always use human generated ones if you could,

  • and in fact, it's very likely when I give a more popular

  • version of this kind of talk, what people point out is even

  • though the kids on the block that maintain the cat--

  • the one about cats.

  • You know, the small, specialized collections.

  • Even though they're not willing or probably not able

  • to do semantic mark-up, they are able to do

  • lots of other creation.

  • They are able to show typical sentences, they are able to do

  • synonyms. And there may be a lot of value added that comes

  • in at the bottom that improves each one of these community

  • collections.

  • I expect that that will be a big, big area when it becomes

  • a big commercial thing.

  • You'll need to have the users helping you to provide better

  • information, by better context.

  • Yes, Greg?

  • AUDIENCE: Remember-- don't go all the way back--

  • I remember the slide about functional phrases, and it

  • seemed that in the three examples that were on that

  • slide, there were of the form, something I might call a

  • template predicate.

  • In other words, A template relates to B. You seem to be

  • saying that the system automatically drive those

  • templates from analyzing the text.

  • Is that correct?

  • BRUCE SCHATZ: That is correct.

  • AUDIENCE: So my question then is this.

  • Can you compare and contrast that technique of producing

  • templates to two other things.

  • Number one, the system that the Cyc guys did to make--

  • BRUCE SCHATZ: [INAUDIBLE].

  • AUDIENCE: --to make predicates, but starting from

  • a different point and ending in a different point, although

  • they have predicates.

  • That's comparison number one.

  • Comparison number two is with respect to, let me just call

  • them template predicates for lack of a better word.

  • If you have those and you created them solely from

  • deriving them from text, then you

  • don't have world knowledge.

  • You basically have knowledge that just

  • came from the documents.

  • It seems to me that getting from the one to the other is

  • what Cyc was trying to do, but I understand that since they

  • were doing it by hand they abandoned that and they're now

  • trying to do automatic techniques.

  • So that thread of thought seems to be in the same

  • ballpark as what you're trying to do here, but with a

  • different approach.

  • I was wondering if you can compare and contrast, and

  • maybe there's a third area of endeavor trying to get to that

  • next step up that maybe you could educate us about.

  • BRUCE SCHATZ: Yeah.

  • That is a very, very good comment.

  • For those of you that don't know what Cyc is, C-Y-C. It

  • was a very ambitious attempt at MCC to try to encode enough

  • common sense knowledge about all of the world so that it

  • could automatically do this kind of thing.

  • As Greg said, it was largely a failure.

  • So let me sort of say what the spectrum of possible things is

  • as a longer answer.

  • Am I running over my time?

  • Is it OK?

  • MALE SPEAKER: It's lunchtime for a lot of these people.

  • Let's say another five minutes and then we'll formally break,

  • and then people who want to hang out it's OK, we got it.

  • [SIDE CONVERSATION]

  • BRUCE SCHATZ: I usually get lunch people by saying there's

  • free food, but that doesn't work here.

  • AUDIENCE: We all work for food.

  • BRUCE SCHATZ: You all work for food.

  • So Greg asked a very good question about where's the

  • line in automaticness.

  • Well, the old way of solving this problem used to be you

  • had a fixed set of templates.

  • What that essentially hit a wall with is each small

  • subject area needed a different set of templates,

  • and it was a lot of work to make the templates.

  • So then they were a set of people that said you needed,

  • if you had a small amount of basic world knowledge, you

  • wouldn't need the templates, you could automatically make

  • training examples.

  • The problem is that that could rarely, only in very isolated

  • cases, could do even as good a tagging as what I am showing.

  • What most of the people do now and what most of the examples

  • that I was showing are is a human comes up with a set of

  • training examples of what are typical sentences with genes

  • in them in this particular subject domain.

  • Then the system infers, exactly as you said, the

  • system infers what the grammar is, what the slots

  • are going to be.

  • There's a few people experimenting with two

  • automatic things, and they don't work at present, but my

  • belief is in the next year or two you'll see research

  • systems with it.

  • If you had a concerted commercial after it you could

  • probably do it and get away with it, it just wouldn't work

  • all the time.

  • They're essentially either trying to automatically make

  • training sets, so you start out with the collection and

  • you try to pull out sentences that clearly have some slots

  • in it and then just infer things from that.

  • Or they try to automatically infer tags, infer grammar.

  • So you know some things, like you know body parts and you

  • know genes, and the question is can you infer behavior,

  • because you know in slots, you already have slots in the

  • particular subject domain.

  • My feeling is one of those two will work well enough so that

  • you can use it automatically and it will always do some

  • kind of tagging.

  • It won't be as accurate as these, which

  • are generally correct.

  • And it could either just be left as is, so it's like a

  • baseline of everything is tagged and 60% of them are

  • correct and 30% of them are ridiculous, but

  • 60% buys you a lot.

  • Or they could be the input to humans generating it.

  • So the curators I'm working with in biology, we already

  • have a couple pieces of software that do this.

  • They don't have to look at all the

  • sentences in all the documents.

  • We give them a fixed set of sentences that are sort of

  • these are typical sentences that might be ones you'd want

  • to look At And then they extract things out.

  • So there's a human step

  • afterwards that does a selection.

  • Almost all the statistical--

  • I went kind of fast through it-- but almost all the

  • statistical programs don't produce correct answers.

  • They produce ranked answers that the top ones are sort of

  • in the next band.

  • My expectation is the tagging will be like that.

  • So the practical question is which things are good for

  • which kind of text.

  • So I guess we have time for another question

  • if anyone has one.

  • MALE SPEAKER: You actually had another one because you

  • started with your easy one.

  • BRUCE SCHATZ: Should we take should someone else?

  • MALE SPEAKER: Let me just suggest, because it is getting

  • close to lunchtime, let me suggest

  • one last basic question.

  • Given all the information about bees, have you been able

  • to figure out why they're disappearing?

  • BRUCE SCHATZ: It turns out actually we have a summer

  • workshop on exactly that topic.

  • And the answer is, like most things about

  • bees, nobody knows.

  • MALE SPEAKER: So much for that idea.

  • OK, well thank you very much, Bruce.

  • We appreciate the time.

  • Those of you who want to hang out, Bruce has time to stay

  • this afternoon.

  • We can all have lunch together.

  • BRUCE SCHATZ: I'm generally hanging out today and tomorrow

  • morning, and there's a lot of stuff about the system up on

  • the BeeSpace site, which you're welcome to look at.

  • And the slides are also going to be made available if you

  • want to flip through them.

  • Thank you everyone for staying through the whole thing.

MALE SPEAKER: This is my attempt to increase the

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it