Subtitles section Play video
MATT CUTTS: Hi, everybody.
We got a really interesting and very expansive question
from RobertvH in Munich.
RobertvH wants to know--
Hi Matt, could you please explain how Google's ranking
and website evaluation process works starting with the
crawling and analysis of a site, crawling time lines,
frequencies, priorities, indexing and filtering
processes within the databases, et cetera?
OK.
So that's basically just like, tell me
everything about Google.
Right?
That's a really expansive question.
It covers a lot of different ground.
And in fact, I have given orientation lectures to
engineers when they come in.
And I can talk for an hour about all those different
topics, and even talk for an hour about a very small subset
of those topics.
So let me talk for a while and see how much of a feel I can
give you for how the Google infrastructure works, how it
all fits together, how our crawling and indexing and
serving pipeline works.
Let's dive right in.
So there's three things that you really want to do well if
you want to be the world's best search engine.
You want to crawl the web comprehensively and deeply.
You want to index those pages.
And then you want to rank or serve those pages and return
the most relevant ones first.
Crawling is actually more difficult
than you might think.
Whenever Google started, whenever I joined back in
2000, we didn't manage to crawl the web for something
like three or four months.
And we had to have a war room.
But a good way to think about the mental model is we
basically take page rank as the primary determinant.
And the more page rank you have-- that is, the more
people who link to you and the more reputable those people
are-- the more likely it is we're going to discover your
page relatively early in the crawl.
In fact, you could imagine crawling in strict page rank
order, and you'd get the CNNs of the world and The New York
Times of the world and really very high page rank sites.
And if you think about how things used to be, we used to
crawl for 30 days.
So we'd crawl for several weeks.
And then we would index for about a week.
And then we would push that data out.
And that would take about a week.
And so that was what the Google dance was.
Sometimes you'd hit one data center that had old data.
And sometimes you'd hit a data center that had new data.
Now there's various interesting tricks
that you can do.
For example, after you've crawled for 30 days, you can
imagine recrawling the high page rank guys so you can see
if there's anything new or important that's hit on the
CNN home page.
But for the most part, this is not fantastic.
Right?
Because if you're trying to crawl the web and it takes you
30 days, you're going to be out-of-date.
So eventually, in 2003, I believe, we switched as part
of an update called Update Fritz to crawling a fairly
interesting significant chunk of the web every day.
And so if you imagine breaking the web into a certain number
of segments, you could imagine crawling that part of the web
and refreshing it every night.
And so at any given point, your main base index would
only be so out of date.
Because then you'd loop back around and you'd refresh that.
And that works very, very well.
Instead of waiting for everything to finish, you're
incrementally updating your index.
And we've gotten even better over time.
So at this point, we can get very, very fresh.
Any time we see updates, we can usually
find them very quickly.
And in the old days, you would have not just a main or a base
index, but you could have what were called supplemental
results, or the supplemental index.
And that was something that we wouldn't crawl and refresh
quite as often.
But it was a lot more documents.
And so you could almost imagine having really fresh
content, a layer of our main index, and then more documents
that are not refreshed quite as often, but there's a lot
more of them.
So that's just a little bit about the crawl and how to
crawl comprehensively.
What you do then is you pass things around.
And you basically say, OK, I have crawled a large fraction
of the web.
And within that web you have, for example, one document.
And indexing is basically taking things in word order.
Well, let's just work through an example.
Suppose you say Katy Perry.
In a document, Katy Perry appears right
next to each other.
But what you want in an index is which documents does the
word Katy appear in, and which documents does the word
Perry appear in?
So you might say Katy appears in documents 1, and 2, and 89,
and 555, and 789.
And Perry might appear in documents number 2, and 8, and
73, and 555, and 1,000.
And so the whole process of doing the index is reversing,
so that instead of having the documents in word order, you
have the words, and they have it in document order.
So it's, OK, these are all the documents that a
word appears in.
Now when someone comes to Google and they type in Katy
Perry, you want to say, OK, what documents might match
Katy Perry?
Well, document one has Katy, but it doesn't have Perry.
So it's out.
Document number two has both Katy and Perry, so that's a
possibility.
Document eight has Perry but not Katy.
89 and 73 are out because they don't have the right
combination of words.
555 has both Katy and Perry.
And then these two are also out.
And so when someone comes to Google and they type in
Chicken Little, Britney Spears, Matt Cutts, Katy
Perry, whatever it is, we find the documents that we believe
have those words, either on the page or maybe in back
links, in anchor text pointing to that document.
Once you've done what's called document selection, you try to
figure out, how should you rank those?
And that's really tricky.
We use page rank as well as over 200 other factors in our
rankings to try to say, OK, maybe this document is really
authoritative.
It has a lot of reputation because it has
a lot of page rank.
But it only has the word Perry once.
And it just happens to have the word Katy somewhere else
on the page.
Whereas here is a document that has the word Katy and
Perry right next to each other, so there's proximity.
And it's got a lot of reputation.
It's got a lot of links pointing to it.
So we try to balance that off.
You want to find reputable documents that are also about
what the user typed in.
And that's kind of the secret sauce, trying to figure out a
way to combine those 200 different ranking signals in
order to find the most relevant document.
So at any given time, hundreds of millions of times a day,
someone comes to Google.
We try to find the closest data center to them.
They type in something like Katy Perry.
We send that query out to hundreds of different machines
all at once, which look through their little tiny
fraction of the web that we've indexed.
And we find, OK, these are the documents that
we think best match.
All those machines return their matches.
And we say, OK, what's the creme de la creme?
What's the needle in the haystack?
What's the best page that matches this query across our
entire index?
And then we take that page and we try to show it with a
useful snippet.
So you show the key words in the context of the document.
And you get it all back in under half a second.
So that's probably about as long as we can go on without
straining YouTube.
But that just gives you a little bit of a feel about how
the crawling system works, how we index documents, how things
get returned in under half a second through that massive
parallelization.
I hope that helps.
And if you want to know more, there's a whole bunch of
articles and academic papers about Google, and page rank,
and how Google works.
But you can also apply to--
there's jobs@google.com, I think, or google.com/jobs, if
you're interested in learning a lot more about how search
engines work.
OK.
Thanks very much.