Subtitles section Play video
MALE SPEAKER: This is my attempt to increase the
sartorial quotient of Google, and it hasn't worked at all.
On the other hand--
well, I noticed you have a coat on, that's true.
Greg Chesson gets two points for showing up with a coat.
It's a real pleasure to introduce Bruce Schatz to you.
I've known Bruce for rather a long time.
My first introduction to him came as we both began getting
excited about digital libraries and the possibility
of accumulating enormous amounts of information in
digital form that could be worked on, manipulated by,
processed through software that we hope would augment our
brain power.
So Bruce has been in the information game for longer
than he's actually willing to admit I suspect.
He's currently at the University of Illinois,
Champaign-Urbana.
As you will remember, that's also the area where the
National Center for Supercomputer
Applications is located.
Bruce was around at the time when Mark and Jason was doing
work on the first browsers, the mosaic versions of the
browsers derived from Tim BernersLee's work.
Actually, the one thing that Bruce may not realize he gets
credit for is teaching me how to pronounce
caenorhabditis elegans.
I looked at it before and I couldn't figure out, and maybe
I didn't even say it right this time.
But this is a tiny little worm that consists of 50 cells.
It was the first living organism that we actually
completely sequenced the genome for.
Then we got interested in understanding how does the
genome actually reflect itself as this little worm develops
from a single fertilized cell.
So Bruce introduced me to the idea of collecting everything
that was known about that particular organism, and to
turn it into a database that one could manipulate and use
in order to carry out research.
Well, let me just explain a little bit more about his
background and then turn this over to him, because you're
here not to listen to his bio, but to listen to
what he has to say.
He's currently director of something called CANIS--
C-A-N-I-S. I thought it had to do with dogs
until I re-read it.
It says Community Architecture is for Network Information
Systems.
BRUCE SCHATZ: That's why they let me in the building.
MALE SPEAKER: I'm sorry.
BRUCE SCHATZ: That's why they let me in the building.
MALE SPEAKER: Because along with the other
canines that are here.
It's at the University of Illinois, Champaign-Urbana,
and he's been working on federated all the world's
knowledge, just like we are, by building pioneer research
systems in industrial and academic settings.
He's really done a lot of work over a period of 25 or 30
years in this domain.
The title of the talk uses the term telesophy, which he
introduced as a project at Belcorp in the 1980s.
Later on, he worked at UIUC on something called DeLIver
D-E-L-I-V-E-R, and now more recently on semantics.
That's the reason that I asked him to come here.
He's working on something called BeeSpace, which is
spelled B-E-E, as in the little buzzing organism.
This is an attempt as I understand it, but I'm going
to learn more, an attempt to take a concept space and
organize it in such a way that we can assist people thinking
through and understanding more deeply what we know about that
particular organism.
So this is a deep dive into a semantic problem.
So I'm not going to bore you with any more biographical
material, except to say that Bruce has about nine million
slides to go through, so please set your modems at 50
gigabits per second because he's going to have to go that
fast to get through all of it.
I've asked him to leave some time at the end for questions.
I already have one queued up.
So Bruce, with that rather quick introduction, let me
thank you for coming out to join us at Google and turn
this over to you to teach us about semantics.
BRUCE SCHATZ: Thank you.
I have one here, so you can just turn yours off.
Thank you.
I was asked to give a talk about semantics, which I
supposedly know something about.
So this is going to be both a talk that's broad and deep at
the same time, and it's going to try to do something big and
grand, and also try to do something deep that you can
take away with it.
So that may mean that it fails completely and does none of
those, or maybe it does all of those.
I've actually been giving this talk for 25 years and--
now, of course, it doesn't work.
Am I not pointing it in the right place?
I'm pushing it but it's not going.
Oh, there it goes.
OK, sorry.
Can you flip it back there?
Sorry about that.
Small technical difficulty, but the man behind the curtain
is fixing it.
So I gave this talk first more than 20 years ago in the hot
Silicon Valley research lab that all the grad students
wanted to go to, which was called Xerox PARC.
I think a few people actually have heard of Xerox PARC.
It sort of still exists now.
We went down completely?
There we go.
Thank you very much.
I was pushing this idea that you could federate and search
through all the world's knowledge, and the uniform
reaction that was, boy, that would be great,
but it's not possible.
And I said, no, you're wrong.
Here, I'll show you a system that searches across multiple
sources and goes across networks, and does pictures
and text and follows links, and I'll explain each piece
about how it works.
Then they said, that's great, but not in our lifetime.
Well, 10 years later was mosaic and the web.
And 20 years later I'm delighted to be here, and all
of you have actually done it.
You've done all the world's knowledge to some degree.
What I want to talk about is how far are you and what you
need to do before you take over the rest of the world and
I die, which is another 20 years.
So what's going to happen in the next 20 years.
The main thing I'm going to say is a lot's happened on
tele, but not too much on sophy.
So you're halfway to the hive mine, and since I'm working on
honey bees, at the end you will see a picture of honey
bees and hear something about hive mines, but it will be
very short.
Basically, if you look at Google's mission, the mission
is doing a lot about access and organization of all the
world's knowledge.
Actually, to a degree that's possible, you do an excellent
job about that.
However, you do almost nothing about the next stages, which
are usually called analysis and synthesis.
Solving actual problems, looking at things in different
places, combining stuff and sharing it.
And that's because if you look at the graph of research over
the years, we're sort of here, and you're doing commercially
what was done in the research area about 10 years ago, but
you're not doing this stuff yet.
So the telesophy system was about here.
Mosaic was about to here.
Those are the things that
searching across many sources--
like what I showed, we're really working pretty well in
research labs with 1,000 people.
They weren't working with 100 million.
But if Google's going to survive 10 more years, you're
going to have to do whatever research systems do here.
So pay attention.
This doesn't work with students.
With students I have to say I'm going to
fail you at the end.
But you have a real reason, a monetary reason, and a moral
reason to actually pay attention.
So back to the outline.
I'm going to talk about what are different ways to think
about doing all the world's knowledge, and how to go
through all the levels.
I'm going to do all the levels and sort of say you are here,
and then I'm going to concentrate on the next set of
things that you haven't quite got to.
The two particular things I'm going to talk about our
scalable semantics and concept navigation, which probably
don't mean anything to you now, but if I do my job right,
45 minutes, actually now 10 of them are up, so 35 minutes
from now they will mean something.
At the end I'm going to talk about suppose you cared about
this enough to do something, what kind of big thing would
you actually do?
I sort of do these big, one of the kind pioneering projects
with stuff that doesn't quite work just to
show it's really possible.
So the overall goal is you probably all grew up on
reading cyberspace novels is sort of plugging your head and
being one with all the world's knowledge.
Trying to sort of get the concepts in your head to match
whatever is actually out there in a way that you can
get what you want.
The problem is over time what the
network can do has increased.
So in the--
I can't say the old days, man--
in the good days, people worked on packets and tried to
do data transmission.
The era that I sort of worked mostly in was an object era
where we try and give the information to people to do,
[UNINTELLIGIBLE] to do pictures.
All the action in big research labs now is on concepts, is on
trying to do deeper things, but still it
worked like these too.
They work everywhere.
So you don't have a specialized AI program that
only works for income taxes.
That's not good enough.
No Google person would ever do something that only works in
one case, unless there was a huge amount of
money behind it.
I'll stop making money comments, but the food is
great here.
So this is one common layout, and there's four or five
others, which in the absence of time, I will omit.
But if you want to talk to me afterwards, there's lots of
points of view about how to get from here to there, where
there is always all the world's knowledge, and here is
whatever you can do now.
Depending on what point of view you take, it's possible
to go to the next step differently because you have a
different orientation.
So the one that I'm going to do in this talk is the
linguistic one, which usually goes syntax, structure,
semantics, pragmatics.
So syntax is what's actually there, like an actual set of
bits in a file, a set of words in a document.
Structure is the parts, it's not the holes.
So if you part something in structure, you can tell that
this particular thing is a person's name, this is the
introduction to a paper, this is the methods part.
You can tell what the parts are and you can search those
differentially.
Semantics is when you go inside and you try to get
something about the meaning, and as you'll see, people have
pretty much given up on doing real meaning, and they pretty
much try to do, rather than meaning,
they try to do context.
What's around it in a way that helps you understand it.
Actually, when Google was a research project, and the
people that started it were actually on the Stanford
Digital Library Project, I was running the Illinois Digital
Library Project at the same time, they said there's enough
context in web links to be able to really do something.
There were a lot of people that said no, web links are
made for all sorts of things, and they don't have any
semantics, and they're not useful at all.
But obviously, they were wrong enough to make this building
and employ all of you.
The real goal is down here in doing actual reality, in doing
with so-called pragmatics.
Pragmatics is sort of when you use something.
So it's test dependent.
The meaning of something is always the same.
so if this is a gene that regulates cancer,
it always does that.
But lots of time, the task you're working on varies what
you're interested in, what you know.
I'm not going to say very much about pragmatics because
people have gotten very far on it in terms of doing a big
grand scale.
But I actually know quite a bit about it.
If you really wanted to solve health care, for example,
you'd have to go down the pragmatic route and try to
measure people with as large a vector as you
can possibly get.
And again, if people are interested, that's a topic I'd
be happy to talk about, but it's off this particular talk.
This particular talk is about federation, as I said.
So what does it mean to federate each
one of those levels?
So to do syntax federation, which is what the telesophy
system pioneered, and for the most part, what Google does in
the sense of federating all the web sources that are
crawled, is it tries to make essentially send the same
query into every different place.
So true syntax federation, which is actually what
telesophy did, but not really what Google does, is you start
at your place and you go out to each one of the sources and
they have to remember where they are on the network.
They might go up and down, and so you might
have to retry them.
And you have to know what syntax the queries need.
And when the results come back, you have to know how to
handle that.
You have to do a lot about eliminating duplicates when
the results come back.
So a very common problem is you send out a query to try to
get a certain Beatles song, and you get back 5,000 of
them, but they're all slightly different, and they're in
different languages and they have different syntax.
Merging those all together is really complicated.
So that's what syntax federation is.
Structure federation, which is what this--
DELIVER was the DLI, the Digital Library Initiative
project that I ran at the University of Illinois.
It took about engineering literature, it went out to 10
major scientific publisher sites on the fly and allowed
you to do a structured query.
So you could say find all the papers in physics journals
that are within the last 10 years that mention
nanostructures in the figure caption in the conclusion.
So you're using the parts of the papers to make use.
And at least scientists make a great deal of
effort in doing that.
In order to do that, you have to figure out some way of
making the mark-up uniform.
So you have problems that you just started to see in
syntactic world where who's an author?
If you have a physics paper that has 100 authors, which
one of them is the author?
It might not be any of them actually, it might be the
organization that did it.
Or if you have a movie, who's the author of a movie?
Is it the producer, the writer,
the star, the director?
So there's a lot of problems there in how you do the
mark-up uniformly and how you make
different values the same.
For the most part, structure has not made it into mass
systems yet, although there have been a lot of attempts to
try to make languages for structure like the semantic
web that Vin and I were talking about beforehand.
But the amount of correctly marked-up structured text is
very small right now.
So if you were going to use it to search the 10 billion items
that you can crawl on the web now, you
wouldn't get very far.
Semantics federation, which is what I'm going to talk about
today mostly, is about a completely different topic.
It's about going inside and actually looking at the
phrases and figuring out the meaning, as much of the
meaning as you can.
And then when you have many small pieces, trying to match
something that's the same here to something the same here.
And doing that uniformly is the job of semantics
federation.
So let me now go into the first of the
two technical topics.
So the first topic I'm going to do is how do you actually
represent the things, and that's going to be a little
slow going.
Then I'm going to give some examples of if you're able to
get this deeper level representation, this deeper
level structuring, what kind of system you can build.
It's in a somewhat specialized domain.
It's in biology and medicine, because well, if you're a
professor and you work at a university, that's where you
can get money to work on things.
You can't get money to work on the kind of things that are
arbitrarily on the web.
So scalable, so we're now into scalable semantics.
I've been using this for 10 years, and every once in while
someone will stand up and say that's an oxymoron, it doesn't
make sense because semantics means really deep, and
scalable means really broad, and those pull in opposite
directions.
And I said yes, you understood what the problem is.
So in the old days, what it used to mean is--
what semantics used to mean is you do deep meaning.
So you had a deep structure parser that would go in and
figure out yes, this document was on operating systems that
only work on this class of computers, and only solved
this class of physics problem.
So it's on a very narrow, detailed topic.
There were many, many AI systems made that did that.
What happened when the government started putting
large amounts of money into it-- so most of this got
developed in the--
the base technology got developed in the DARPA Trek
program trying to read newspaper articles looking for
what would now be called terrorists.
What they found basically is the deep
programs were very narrow.
If you trained something to recognize income taxes, or you
trained something to recognize high-powered rifles, it
wouldn't help at all in the next one.
And there were just too many individual topics to try to
pick out the individual types of sentences
and individual slots.
So what happened is the broad ones beat out the deep ones
when the machines got really fast. When it became clear,
and I'll show you some machine curves, when it became clear
that you could actually parse noun phrases arbitrarily out,
then people begin using noun phrases.
When it became clear you could do what are called entities,
in other words, you could say this phrase
is actually a person.
This phrase is actually someone that lives in
California.
Then people started using it.
Basically what happened is semantics changed from being
we know everything about this particular topic and this
phrase means one, it's meaning type 869, to we have 20 kinds
of entities, and this is a gene, and it occurs with his
other gene.
So we'll say if you search for this gene and it doesn't work,
you should search for this other one.
I'll show you lots of cases where that sort of guilt by
association really helps.
I'm not defending it necessarily as being real
semantics, I'm defending it as something that you can do
everywhere.
So the upshot is this is an engineering problem.
It's a question of if you could do deep parsing and say
yes, this person wasn't--
it's true they said they were interested in ice cream cones,
but they really meant pine cones when they said cone,
then you would do that.
But it's generally not possible to do that, except in
very isolated circumstances.
So you end up thinking globally, thinking about all
possible knowledge, but acting locally.
I guess this is a green building so I'm allowed to
make this kind of joke.
So you look at a small, narrow collection, and analyze the
context, what occurs with each other very precisely, and do
something there.
And that creates one good situation.
In other words, it means now you're able to go much deeper,
and I'll show you lots of examples of going much deeper.
But it creates one bad situation, which is
traditionally information retrieval works like dialogue
did in my era, or like Google does now.
You take everything you can get, and pile it into one big
huge server farm, and then you search it.
You index it once in one big index and you search it.
Well, the problem is if you want to go deeper in semantics
that doesn't work, because you mixed
together too many things.
You have to unmix them, and then you have to worry about
to get from here to there.
So you change a central problem into a distributed
problem with all of the hard features that go with
distribution.
What this is doing in terms of-- if you
want physical analogy.
For many years I taught at a library school.
The way indexes work in the real world is for really big
topics like if you have electrical engineering,
there's a society that is big enough and well-defined enough
to employ people to tag every topic.
So they say here's an article about Windows, this one is
about operating systems. Here's an article about
Windows, this one is about heat conservation.
A person who's looking at that, and out of their
selection of all the topics, they say which topics the
things are on.
That worked fine as long as most the information in the
world was in these large, but fairly small number of
well-defined databases.
That's not the world we're living in now.
We're mostly living in this world.
So there still are a very large number of big formal
databases that are done by hand, but nearly all the
databases, nearly all the collections, are these
informal ones with communities or groups or individuals.
The advance of crawling technology that's been able to
take all these and collect them all together into one big
place has actually made the problem worse because now
there's not only apples and oranges and pears altogether,
but there's lots of things that aren't fruit at all and
aren't really anything, but they're in there.
So there's many different things that you don't know how
to deal with, and you have to do something
automatically with them.
It's not the case that you can get--
my daughter who keeps track of all the cats on the block and
has a website with their pictures, it's not the case
that you can get her to employ a professional curator from
the library school who will tag those correctly so that
someone who's a cat fancier in the next town can see them.
That's not true.
Need some kind of automatic support.
So I'm going to talk about the automatic support.
I'm doing OK for time.
There's two things.
I'm going to talk about entities and I'm going to talk
about concepts.
So here are entities.
What entities are is trying to figure out what type of thing
something is.
So one way is you have hand-tagged XML, like the
mark-up, like the semantic web.
So they take a particular domain and they say there are
20 types here, and we'll mark-up
each document correctly.
So if we're in humanities that might work pretty well.
This is a person, this is a place, this is a type of vase,
this is a time period in Roman history.
If you're out on the web in that situation where 90% of
the stuff's informally, then even if there was a systematic
set of types, the people aren't going to do it.
So if you have well marked-up hand ones you're going to use
them, but if you don't then you have to
do something automatic.
The thing that tends to work automatic is to try to tag
things by machine with training sets, and I'm going
to say a little bit about what that means.
First you go into the document and you pull out the phrases.
So you don't do whole words.
And in fact, over time the experimental system I've built
have gotten better the more you can get away from words
and change them into whole phrases that are the
equivalent phrase that works in that particular domain.
Right now search engines don't do that.
That's a big part of the problem.
Then you have to recognize the part of speech.
Is it a noun or a verb or an object.
Again, 10 years ago, you needed a specialized grammar
and it only worked in a particular subject.
Now there's things trained on enough machine learning
algorithms, trained on enough things that you can get very
high accurately, up in the high 90s,
with parts of speech.
And in fact, they're actually systems, and you can tell this
was secretly funded by the CIA under some other name that
recognized person, places, and things pretty accurately.
So if you want to recognize newspaper articles and
automatically tag these correctly, it actually does a
pretty good job.
Again, commercial search engines tend not to use those.
So here's an example of entities in biology.
These won't mean very much, but they'll
give you the feeling.
Here's a kind of functional phrase.
A gene is a type of an entity and encodes a chemical.
So here's an example.
The foraging gene encodes a cyclic GMP protein kinase.
So this is one of the entities and this is the other entity.
In scientific language things are very regularized, so
there's lots of sentences that are actually that easy.
Or here's another one.
Chemical causes behaviors.
Here's one that's a little harder.
I tried to put one a little harder.
This one says gene regulates behavior, but that's not in
the sentence.
What's actually in the sentence is this gene, which
is a ortholog of this other gene-- so it doesn't say
directly, it says indirectly it's a gene--
is involved in the regulation, which is not the same phrase
as regulates.
So you have to do a little bit of parsing to get a phrase
like gene regulates behaviors.
But the natural language technology is now good enough
to do that accurately.
I did do a little bit of prep and look at some of the
commercial systems that were doing this.
If you want to ask a question about those
later I'll make a comment.
But they're all competitors, they're not Google, so I
didn't want to say up front.
The last comment I'm going to make about entities is they
come in different varieties.
That means that sometimes you'll do them and
sometimes you won't.
So there's some of them, and again,
these are biology examples.
There's some of them that are just straight lists, so the
names of organisms, like honey bee or fruit fly are almost
always exactly those same words.
So those are easy entities to tag very accurately.
Things like genes or parts of the body vary somewhat, but
there often are tag phrases that say this is the part of a
body and here it is.
It's a wing.
Or this is a gene and it's the foraging gene.
So there are often tags there.
If you get training sets you do pretty well.
Then there's really hard things like what kind of--
these are sort of functional phrases--
what kind of behavior is the honey bee doing?
What kind of function does the computer operate with?
Those ones are almost always different, so you need a
really big training set to do those accurately.
If you were going to try to do entities across all the
world's knowledge, you would have two problems. I think
that's the last thing I'm going to say on this, yes.
The first is you would have to try to make a run at the hard
ones, or at least say well, we're only going to do these
because that's all we can do uniformly.
The second thing is you have to realize that the entities
are different in each major subject areas.
So the biology ones are not the same as the medicine ones,
which are more disease-like, and the medicine ones aren't
the same as the physics ones, and the physics ones aren't
the same as the grocery store ones.
My guess is there's a relatively limited number of
popular ones, and if you're back in the same style that,
trying to classify all the web knowledge, like Yahoo!--
that used to be Yahoo!'s main strategy, for instance.
That there are a couple hundred really important ones
and a couple thousand big ones.
So if you had enough money and enough expert teams and set
each one up to making training sets, you could actually do
entities all the way across.
A research project can't muster that except in one
small area.
That's all I'm going to say about entities.
Now, let me explain just a little bit about what you do
with entities, and then give a big example.
So what you do with entities, you know you might think
you're going to answer questions with them, and
that's what the commercial systems are doing.
You can sort of answer questions, so you can say this
gene seems to affect this behavior and this organism.
So you can say what are all the things that affect
foraging in insects and get out lots of--
this is sort of you have a relational table.
You take a document and change it into a relational database.
You can answer that kind of question, but there's lots of
kinds of questions you can't answer.
What you can do is after you extract these entities, these
units, is you can compute these context graphs.
You can see in this document how often do these two things
occur together.
That one you get a lot of mileage from, because if you
try to search for this one and you can't find it, you can
search for this other one.
Or if you're trying to search for this one and you can't
find it, you can go down the list of the ones it commonly
occurs with and it's sort of a suggestion facility.
People that watch search in libraries, what they typically
comment on is people don't know what words to try.
They'll try all the words they can think of and then they'll
start searching dictionaries or looking at other papers or
asking the people next to them.
So since you can automatically do suggestion by making this
graph of all entities that are related to all the other
entities in terms of how often they occur together in a
collection, then you can use it for suggestion.
This is my computer engineering slide.
The other unusual feature about Google that people
didn't predict is could you build a big enough
supercomputer to handle 10 billion items. And dialogue
would have said no because IBM will not sell you that many
platters and that big a thing.
Well, what they didn't realize was what the rise of PCs would
do if you hook things together and you could partition the
problem enough.
The research people hit that same curve a decade earlier.
So I was trying to do these relations--
this is my six or seven year history.
These are all how big a collection you can do and find
these entities and these relations basically on
workstations.
So this is like a Sun-2, and that's a Sun-3, and this is a
network of Sun-3's, about 10 of them.
This one is discovering supercomputers at NCSA and you
could get 1,000 all at one time.
That made a big difference, and it meant-- in fact, this
was a big hero experiment.
It was the first supercomputer computation
in information retrieval.
For quite a while, it was the biggest computation that NCSA
had ever done.
They couldn't figure out why you'd want to integrate all
the world's knowledge.
Why would anybody want to do that?
I think in 1998, Google was probably about 10 employees.
So that question hadn't come up yet.
The number of articles in Medline was still much greater
than the number of articles on the web.
So here's what that computation was like.
It had about 280 million concepts, so that
number was big then.
It's now small.
However, if you fast forward to today, the machines are lot
faster, so the server I just bought for $25,000 has more
memory than that supercomputer eight years ago.
These are big memory computations.
You can guess it's got a big matrix inside that has all the
phrases versus all the phrases and how often they occur.
So the more physical RAM you have the better.
What it turns out is if you're able to put a connection
graph, so this is a graph of which terms are related to
which other terms all in memory.
And it's a nice graph, like a small worlds graphs, which
looks kind of like this.
So there's a group here that's all sort of connected and
another group here.
So it comes in groups.
That tends to be true of just about any kind of text that
people have seen.
Then you can find all the inter-relations really fast
because you don't have to look at this one versus this one
because you know that you can stop here.
So there's a way of ending the propagation.
What that means is you can now do things on the
fly while you wait.
So you don't have to pre-compute the collections
anymore, which is what we had to do before.
You can do a search, make a new collection, and then make
it semantics.
Then make it deep.
You can cluster it on the fly into little chunks, you can
find these inter-related graphs while you wait.
And that's while you wait with a $25,000 server.
If you had something the size of Google you could not only
do all the world's knowledge, which isn't that much
proportionally bigger, but you could also do deeper.
So that's why now I want to show you what a real
concepts-based system looks like so that you get some
feeling as to how different the interaction is.
Generally, there's two things in a space system.
One of them is called federation--
I've been talking about that before.
It's how do you go from one collection to another.
The other is called the integration.
It's if you have an entity what can you go out to.
We're talking about going across collection, and I
didn't mean to say this was replacing IP.
IP is under everything.
But what I meant is it's replacing words.
This was the first interspace system, the
one DARPA paid for.
There aren't words in this anymore.
When you point to this you get--
you get that whole phrase, simple analgesics, and all the
things that are equivalent to it phrase-wise after you do
all the linguistic parsing.
So it looks like a bunch of words, but it isn't.
It's a bunch of inter-related concepts and those are
uniformly indexed across all the sources.
So you can go from simple analgesics here to all the
concepts, all the phrases that it's nearby, to the ones that
are nearby there, to the documents, to which little
cluster it's in.
You can sort of go from concept to concept, the
concept across all the different sources.
The main reason I showed this was to just show that words
don't exist anymore.
You've got these deeper level thing, which I tried to
convince you earlier was possible to do.
Also because this DARPA project broke up in 2000 just
before 9-11, and DARPA decided, they yanked the plug,
and they didn't want to help analysts anymore.
Every person on the project went to work at Microsoft.
So it's entirely possible that Windows 2010 is going to have
all this stuff in it.
Yes, question?
AUDIENCE: That's the [INAUDIBLE].
BRUCE SCHATZ: Which one, this one?
AUDIENCE: The hexagon, [INAUDIBLE].
BRUCE SCHATZ: It is actually, and let me postpone that
because I'm going to answer it better later.
But basically, it's taking the document collection and
grouping it in to individual groups of documents which have
a similar set of phrases in them.
This is just a bad graphical representation of it.
But I'll give a good example of it later.
So yes, it's quite meaningful, and you'll see in the session
what its utility is.
So what I'm actually going to talk about now in the last
five minutes of my talk is this BeeSpace system.
It is about honey bees, so you're allowed to have cute
pictures and cute puns.
Yeah, see?
The college students don't ever laugh at this, but I
always thought it was funny, bee-havior.
So much for that.
It must be an age thing.
So the point of this system is you make many, many small
collections, and you know something, and you want to use
your terminology and your knowledge to
go somewhere else.
So you want to go from molecular biology into bees,
into flies, into neuroscience.
So I'm working with the person that actually is the national
lead on the honey bee genome.
What it does inside is basically uses this scalable
semantics technology to create and merge spaces--
you'll hear a lot about spaces in the next five minutes, so I
won't explain right now--
to try to find stuff.
So it's complete navigation, complete abstraction, but
finding things when you don't know what you started with.
Space is a paradigm, not a metaphor.
I hope I'm not offending any user interface people.
I'm not sure if Dan is still sitting in the back.
In other words, there really are spaces in there.
You take a collection, you make it into a space.
You can then merge two of them, you can pull out part of
it, you can break it into parts and make one part of
that the whole space.
So it's like you have all the world's knowledge and you're
breaking it into conceptual spaces which you can
manipulate.
You personally, plus you can share them with other people.
So it has quite a different character than you're trying
to do a search and you get a set of results back.
This particular one does do entities very universally, but
it only does concepts and genes because that's all this
subject area needed.
So please don't criticize that particular one.
It was chosen narrowly because we wanted to at least have one
that did those uniformly.
These are the main operations I'm now going to show you very
quickly through a session.
If you go to the BeeSpace site, which was on that bag.
It's beespace.uiuc.edu.
You can use the system and beat it to death, assuming you
can read Midline articles, which you may or
may not be able to.
So extract is going to take a space and figure out what all
special terms that distinguish that space and
have a way of searching.
Mapping is going to go back the other way.
It's going to take a space and break it into parts, and then
you can turn each one into a space itself.
This is space algebra, and this is the summarization.
If you find an entity, it does something with it.
This example probably don't care, but it's looking at
behavioral maturation.
It's looking at a honey bee as it grows up, it takes on
different societal roles.
It takes care of the babies, it goes out and forges for
food, and looking at that across different species.
So it's a complicated question.
It's not one that there's a well-defined answer to.
So now we're into the BeeSpace system, which is
running right now.
So you type behavioral maturation, you choose a
particular space that was already made out, it's
insects, it's about 100,000, and you do browse.
So that gets about 7,000 articles, which are here,
which is too much to look at.
The problem was behavioral maturation
wasn't the right term.
The first thing the system's doing is it's extracting.
It tries to go in and analyze the terms, the phrases, and
get out a more detailed set.
So that's issuing extract.
It automatically pulls out the most discriminating terms in
that collection, and you usually have
to edit it a little.
That's what I did here.
Then you can take those back and browse again.
It's not working.
Oh, did it go?
Yeah, I'm sorry.
There it is.
You got more items. 22,000.
AUDIENCE: That's not necessarily good if you were
trying to narrow it down.
BRUCE SCHATZ: The problem was you narrowed it down too much.
You didn't actually get the articles about behavior
maturation because a lot of them didn't say it.
What you want to get is all of the things that might be
interesting and then narrow it down.
So that first one was trying to expand it a little bigger.
It was doing sort of a semantic
version of query expansion.
Now the problem is this one is too many to actually look
through, and now I'm going to go back the other way and sort
of answer the question that was asked before.
So this is automatically taking that collection, and
while you wait, it's breaking it into a number
of different regions.
Here it's about 20-- some of them are off the page.
And the regions tend to be--
they're sort of these small worlds regions that tend to be
tightly on the same topic.
The topics are kind of hard to describe what the topics are
because they're automatic, they're not on some
well-defined term, but they tend to cluster together well.
The thing to notice is this was done while you wait, even
with this small server.
So the collection was made on the fly, and this mapping was
done on the fly.
This is an interactive operation.
The pre-computation wasn't about this.
You didn't have to can this.
So we take this particular region, and now
we're going to operate--
this is that, just that one cluster.
Now we're going to save it.
And see, now it's a fully fledged space just like all
the previous ones.
So we were sort of navigating through space, we found this
little collection is what we want, and now we're making it
into a space ourselves.
This one is now well-defined, it's about behavioral
maturation in a large about insects, and we wanted to look
at multiple organisms. So now we're going to start doing
space algebra.
We're going to start taking this and merging
it with other things.
So here I took the new space I just made and I'm intersecting
it with an old space.
Currently, that's just finding the documents in common, but
we're now working on fancier data mining to
try to other patterns.
So here's the 21 that have that feature.
If you look at this article, this article is, in fact,
about some basic receptor about Drosophilidae, which is
the fruit fly, an insect, but it's about--
well, I'm sorry it's not fishes.
Marine crustaceans are like lobsters.
But it's something that lives in the sea.
Since you now found something at the intersection of those
two, what you really wanted to do was describe the genes.
Here you can point to this gene.
This was entity-recognized automatically in green, and
tried to summarize it.
So here it's summarized.
You can see the summary parts, however the problem is this
particular intersected space has hardly any
documents in it.
So there's not very much to summarize.
You did get the right gene, but you didn't summarize it
against the useful space.
What you want to do is go switch this term over into
this other space, into the Drosophilidae space, which has
like 50,000 articles, and then summarize it again.
So here's an article that has it in it.
This one you can see has more entities
automatically selected.
Then here's the gene summary against that space, again,
done on the fly.
So this is a general summary facility that if you have an
entity and you have a space, so you have a specific term
and a collection, you want to see what's known about it in
that collection.
This is a type of summary you can do while you wait.
It's a scalable one.
You can break it into a well-known category.
You can rank order the sentences in those particular
categories.
This is kind of like a new summary, but it's a semantic
type new summary.
Then if you went into that, you would see that there lots
of entities recognized here.
All the things in green were done automatically, and if you
pointed to these you would then summarize those in this
space, or you can go off to another one and summarize it.
So I want to just say each one of these main features was
done dynamically on new collections, while you wait.
And you can basically expand the searches, you can take a
search that's too big and break it into pieces, you can
make a new space and do algebra, do intersection on
it, or if you find a particular entity, you can
summarize it in different ways.
Those are examples of the kinds of things that you can
all do automatically.
So the message is these are all general, and if you have
to do biology you sort of work up towards the inter-space
where your intersecting all the spaces using these sets of
ones, by doing birds and bees and pigs and cows and brains
and behavior.
These are actually all projects I'm working on.
I work in the genome center where this
project is going on.
So it is a birds and bees in pigs and cows
project in some respect.
Let me now conclude to allow some time for questions by
just saying this is actually quite a different world.
It's not pile all the world's knowledge in one big place.
It's have many small little ones, including ones that are
sort of dynamic communities that are made on the fly.
And because of that every person that's doing it is
actually doing just about everything there--
indexing it, using the system, they're making new
collections, they're authoring materials themselves.
And the system itself could be occurring
all in one big server.
And ours, of course, does.
But it could also occur in many small places.
It's a very small, localized kind of system.
My guess is if you had to do this on 10 trillion, which is
what's going to be true in a decade on the web, then you
wouldn't have four or five big servers that cover the world.
What you'd have is at the end of every block, or you'd have
a hierarchy like the telephone network used to where you'd
have servers that actually handled each set of spaces
that they were doing live manipulation against. It's
quite a different world.
It's much more like the virtual worlds that the kids
today wander around.
Maybe you all are a little bit too old to spend all your time
on Neopets or even on Second Life.
So I promised I would end with what's a grand
project you could do.
So one grand project you could do is take some set of people,
like university is very convenient because you can
force undergraduates to do just about anything.
If you're at the University of Illinois,
there's 35,000 of them.
There's quite a few of them.
There's few less because some of them came here.
And you capture all the text, the library and the courses
actually where--
our library has just gone into the Google Books program, and
all the context which tries to do the relationships,
partially by live with this kind of system,
and partially by--
well, I guess this is actually OK to say but, if you gave
everyone at the University of Illinois free
Gmail and a free Gphone--
I guess the Gphone isn't announced yet, but there's
lots of rumors on the web that there will be one.
Anyway, if you gave everybody a free email and phone and
said with the proviso that we're going to capture
everything you ever do and we're going to use it for good
purposes, not selling you ads, but trying to relate things
together to help you understand the context, then
the university would be delighted because they'd like
to educate, not just the people on campus, but people
all over the world and make money charging them tuition.
People at Google might be delighted because normally you
couldn't do this experiment because you would get sued out
of existence, even with your lawyers I would guess, if you
tried to surreptitiously capture all the Voit that was
coming out of Gphone.
That's not proposed is it?
I've had people tell me the University of Illinois might
refuse to have it done.
But if the undergrad takes it, that's the deal, right?
They take it as long as we're going to record everything.
You might really be able to build a semantically-based
social network, so you're not sharing a YouTube video by
it's got the same little tag on top of it, but by some of
real, deep, scalable semantics underneath.
So that's all I have to say, and I did promise I would put
some bees at the end.
So someday we will do hive mine, and it probably will be
in your guys' lifetime, but not in mine.
That's all I have to say.
Thank you.
[APPLAUSE]
Question, yes?
AUDIENCE: I was wondering--
[SIDE CONVERSATION]
AUDIENCE: I was wondering could you use the semantic
relationships that you've built up to debug
the language itself?
In other words, create some kind of metric that detects
whether the description or the expression of a particular
concept is coherent or incoherent, and essentially
flag places where the terminology is insufficiently
expressive.
BRUCE SCHATZ: Could you hear the question or
should I repeat it?
OK.
The question was can you regularize the language since
you're now detecting all these patterns?
That's actually been done quite a bit with tagging to
quite a large degree of success.
So the reason that our digital library project succeeded and
the one at Elsevier, which was a big publisher failed, is we
had a set of programs that went through and automatically
cleaned up the tagging, the structure tagging, that was
coming back from the publishers that the authors
had provided, and then sent corrective information to the
author's telling them what they should have done.
But the things that went into our system were
the cleaned up ones.
It's what data mining people call cleaning the data.
It is true that the more regular things are the better
they work, so that if you tried to do a chat session,
like an IM text messaging, it would work much worse than it
did with biology literature, which is much more
regularized.
The general experience with these kinds of systems is that
people are much better at hitting the mark than
computers are at handling variability.
So it's kind of like those handwriting recognizers that
you learned how to write [UNINTELLIGIBLE].
So my guess is that yes, the users are trainable.
And if I tried to do this with undergrads, I would certainly
do things like fail people that got in too many-- you
know, it's like if you programs don't parts correctly
than you don't get a passing grade.
It's a problem though.
The more regular the world is, the better this brand of
semantics does.
Is there another question?
Yes.
AUDIENCE: I will start with a simple practical question.
When I go to PubMed and ask for references including
phytic acid, it knows that phytic acid is inositol
hexakisphosphate.
Is there any automation in that process, or is that just
a laborious transcription process on the
part of human being.
BRUCE SCHATZ: OK, if you're asking what PubMed does, the
answer is they have a big translation table with all
those wired in.
It's because they're a large organization
with a lot of libraries.
They're actually able to provide a large set of common
synonyms to things.
If you have an automatic system it can't do that.
Well, actually ours is sort of a hybrid system.
Ours actually uses the synonyms like that that PubMed
has as a boost to finding equivalent ones.
If you're not able to do that, there's a whole set of
linguistics processing that tries to find things that are
synonyms to different degrees of success.
It looks for things that are in the
same slots and sentences.
It looks for equivalent sentences that had different
subjects that were used the same.
It uses ways that acronym expansion are commonly done.
There's a set of heuristics that work some of the time,
maybe two-thirds of the time in regularized text like this.
But they're not perfect in the way-- the ones you're seeing
are all human generated, and that's why they're so good.
You will always use human generated ones if you could,
and in fact, it's very likely when I give a more popular
version of this kind of talk, what people point out is even
though the kids on the block that maintain the cat--
the one about cats.
You know, the small, specialized collections.
Even though they're not willing or probably not able
to do semantic mark-up, they are able to do
lots of other creation.
They are able to show typical sentences, they are able to do
synonyms. And there may be a lot of value added that comes
in at the bottom that improves each one of these community
collections.
I expect that that will be a big, big area when it becomes
a big commercial thing.
You'll need to have the users helping you to provide better
information, by better context.
Yes, Greg?
AUDIENCE: Remember-- don't go all the way back--
I remember the slide about functional phrases, and it
seemed that in the three examples that were on that
slide, there were of the form, something I might call a
template predicate.
In other words, A template relates to B. You seem to be
saying that the system automatically drive those
templates from analyzing the text.
Is that correct?
BRUCE SCHATZ: That is correct.
AUDIENCE: So my question then is this.
Can you compare and contrast that technique of producing
templates to two other things.
Number one, the system that the Cyc guys did to make--
BRUCE SCHATZ: [INAUDIBLE].
AUDIENCE: --to make predicates, but starting from
a different point and ending in a different point, although
they have predicates.
That's comparison number one.
Comparison number two is with respect to, let me just call
them template predicates for lack of a better word.
If you have those and you created them solely from
deriving them from text, then you
don't have world knowledge.
You basically have knowledge that just
came from the documents.
It seems to me that getting from the one to the other is
what Cyc was trying to do, but I understand that since they
were doing it by hand they abandoned that and they're now
trying to do automatic techniques.
So that thread of thought seems to be in the same
ballpark as what you're trying to do here, but with a
different approach.
I was wondering if you can compare and contrast, and
maybe there's a third area of endeavor trying to get to that
next step up that maybe you could educate us about.
BRUCE SCHATZ: Yeah.
That is a very, very good comment.
For those of you that don't know what Cyc is, C-Y-C. It
was a very ambitious attempt at MCC to try to encode enough
common sense knowledge about all of the world so that it
could automatically do this kind of thing.
As Greg said, it was largely a failure.
So let me sort of say what the spectrum of possible things is
as a longer answer.
Am I running over my time?
Is it OK?
MALE SPEAKER: It's lunchtime for a lot of these people.
Let's say another five minutes and then we'll formally break,
and then people who want to hang out it's OK, we got it.
[SIDE CONVERSATION]
BRUCE SCHATZ: I usually get lunch people by saying there's
free food, but that doesn't work here.
AUDIENCE: We all work for food.
BRUCE SCHATZ: You all work for food.
So Greg asked a very good question about where's the
line in automaticness.
Well, the old way of solving this problem used to be you
had a fixed set of templates.
What that essentially hit a wall with is each small
subject area needed a different set of templates,
and it was a lot of work to make the templates.
So then they were a set of people that said you needed,
if you had a small amount of basic world knowledge, you
wouldn't need the templates, you could automatically make
training examples.
The problem is that that could rarely, only in very isolated
cases, could do even as good a tagging as what I am showing.
What most of the people do now and what most of the examples
that I was showing are is a human comes up with a set of
training examples of what are typical sentences with genes
in them in this particular subject domain.
Then the system infers, exactly as you said, the
system infers what the grammar is, what the slots
are going to be.
There's a few people experimenting with two
automatic things, and they don't work at present, but my
belief is in the next year or two you'll see research
systems with it.
If you had a concerted commercial after it you could
probably do it and get away with it, it just wouldn't work
all the time.
They're essentially either trying to automatically make
training sets, so you start out with the collection and
you try to pull out sentences that clearly have some slots
in it and then just infer things from that.
Or they try to automatically infer tags, infer grammar.
So you know some things, like you know body parts and you
know genes, and the question is can you infer behavior,
because you know in slots, you already have slots in the
particular subject domain.
My feeling is one of those two will work well enough so that
you can use it automatically and it will always do some
kind of tagging.
It won't be as accurate as these, which
are generally correct.
And it could either just be left as is, so it's like a
baseline of everything is tagged and 60% of them are
correct and 30% of them are ridiculous, but
60% buys you a lot.
Or they could be the input to humans generating it.
So the curators I'm working with in biology, we already
have a couple pieces of software that do this.
They don't have to look at all the
sentences in all the documents.
We give them a fixed set of sentences that are sort of
these are typical sentences that might be ones you'd want
to look At And then they extract things out.
So there's a human step
afterwards that does a selection.
Almost all the statistical--
I went kind of fast through it-- but almost all the
statistical programs don't produce correct answers.
They produce ranked answers that the top ones are sort of
in the next band.
My expectation is the tagging will be like that.
So the practical question is which things are good for
which kind of text.
So I guess we have time for another question
if anyone has one.
MALE SPEAKER: You actually had another one because you
started with your easy one.
BRUCE SCHATZ: Should we take should someone else?
MALE SPEAKER: Let me just suggest, because it is getting
close to lunchtime, let me suggest
one last basic question.
Given all the information about bees, have you been able
to figure out why they're disappearing?
BRUCE SCHATZ: It turns out actually we have a summer
workshop on exactly that topic.
And the answer is, like most things about
bees, nobody knows.
MALE SPEAKER: So much for that idea.
OK, well thank you very much, Bruce.
We appreciate the time.
Those of you who want to hang out, Bruce has time to stay
this afternoon.
We can all have lunch together.
BRUCE SCHATZ: I'm generally hanging out today and tomorrow
morning, and there's a lot of stuff about the system up on
the BeeSpace site, which you're welcome to look at.
And the slides are also going to be made available if you
want to flip through them.
Thank you everyone for staying through the whole thing.