Subtitles section Play video
The following content is provided under a Creative
Commons license.
Your support will help MIT OpenCourseWare
continue to offer high-quality educational resources for free.
To make a donation or to view additional materials
from hundreds of MIT courses, visit MIT OpenCourseWare
at ocw.mit.edu.
JULIAN SHUN: Hi, good afternoon, everyone.
So today, we're going to be talking
about graph optimizations.
And as a reminder, on Thursday, we're
going to have a guest lecture by Professor Johnson of the MIT
Math Department.
And he'll be talking about performance
of high-level languages.
So please be sure to attend the guest lecture on Thursday.
So here's an outline of what I'm going
to be talking about today.
So we're first going to remind ourselves what a graph is.
And then we're going to talk about various ways
to represent a graph in memory.
And then we'll talk about how to implement
an efficient breadth-first search algorithm, both serially
and also in parallel.
And then I'll talk about how to use graph compression and graph
reordering to improve the locality of graph algorithms.
So first of all, what is a graph?
So a graph contains vertices and edges,
where vertices represent certain objects of interest,
and edges between objects model relationships between the two
objects.
For example, you can have a social network,
where the people are represented as vertices
and edges between people mean that they're
friends with each other.
The edges in this graph don't have to be bi-directional.
So you could have a one-way relationship.
For example, if you're looking at the Twitter network,
Alice could follow Bob, but Bob doesn't necessarily
have to follow Alice back.
The graph also doesn't have to be connected.
So here, this graph here is connected.
But, for example, there could be some people
who don't like to talk to other people.
And then they're just off in their own component.
You can also use graphs to model protein networks, where
the vertices are proteins, and edges between vertices
means that there's some sort of interaction
between the proteins.
So this is useful in computational biology.
As I said, edges can be directed,
so their relationship can go one way or both ways.
In this graph here, we have some directed edges and then also
some edges that are directed in both directions.
So here, John follows Alice.
Alice follows Peter.
And then Alice follows Bob, and Bob also follows Alice.
If you use a graph to represent the world wide web,
then the vertices would be websites,
and then the edges would denote that there is a hyperlink
from one website to another.
And again, the edges here don't have to be bi-directional
because website A could have a link to website B.
But website B doesn't necessarily
have to have a link back.
Edges can also be weighted.
So you can have a weight on the edge that
denotes the strength of the relationship
or some sort of distance measure corresponding
to that relationship.
So here, I have an example where I am using a graph
to represent cities.
And the edges between cities have
a weight that corresponds to the distance between the two
cities.
And if I want to find the quickest way to get from city A
to city B, then I would be interested in finding
the shortest path from A to B in this graph here.
Here's another example, where the edge weights now
are the costs of a direct flight from city A to city B.
And here the edges are directed.
So, for example, this says that there's
a flight from San Francisco to LA for $45.
And if I want to find the cheapest
way to get from one city to another city,
then, again, I would try to find the shortest path in this graph
from city A to city B.
Vertices and edges can also have metadata on them,
and they can also have types.
So, for example, here's the Google Knowledge Graph,
which represents all the knowledge on the internet
that Google knows about.
And here, the nodes have metadata on them.
So, for example, the node corresponding to da Vinci
is labeled with his date of birth and date of death.
And the vertices also have a color
corresponding to the type of knowledge that they refer to.
So you can see that some of these nodes are blue,
some of them are red, some of them are green,
and some of them have other things on them.
So in general, graphs can have types and metadata
on both the vertices as well as the edges.
Let's look at some more applications of graphs.
So graphs are very useful for implementing queries
on social networks.
So here are some examples of queries
that you might want to ask on a social network.
So, for example, you might be interested in finding
all of your friends who went to the same high school as you
on Facebook.
So that can be implemented using a graph algorithm.
You might also be interested in finding
all of the common friends you have with somebody else--
again, a graph algorithm.
And a social network service might run a graph algorithm
to recommend people that you might know and want
to become friends with.
And they might use a graph algorithm
to recommend certain products that you
might be interested in.
So these are all examples of social network queries.
And there are many other queries that you
might be interested in running on a social network.
And many of them can be implemented
using graph algorithms.
Another important application is clustering.
So here, the goal is to find groups of vertices
in a graph that are well-connected
internally and poorly-connected externally.
So in this image here, each blob of vertices of the same color
corresponds to a cluster.
And you can see that inside a cluster,
there are a lot of edges going among the vertices.
And between clusters, there are relatively fewer edges.
And some applications of clustering
include community detection and social networks.
So here, you might be interested in finding
groups of people with similar interests or hobbies.
You can also use clustering to detect fraudulent websites
on the internet.
You can use it for clustering documents.
So you would cluster documents that
have similar text together.
And clustering is often used for unsupervised learning
and machine learning applications.
Another application is connectomics.
So connectomics is the study of the structure, the network
structure of the brain.
And here, the vertices correspond to neurons.
And edges between two vertices means
that there's some sort of interaction between the two
neurons.
And recently, there's been a lot of work
on trying to do high-performance connectomics.
And some of this work has been going on here at MIT
by Professor Charles Leiserson and Professor Nir Shavit's
research group.
So recently, this has been a very hot area.
Graphs are also used in computer vision--
for example, in image segmentation.
So here, you want to segment your image
into the distinct objects that appear in the image.
And you can construct a graph by representing the pixels
as vertices.
And then you would place an edge between every pair
of neighboring pixels with a weight that corresponds
to their similarity.
And then you would run some sort of minimum cost cut algorithm
to partition your graph into the different objects that
appear in the image.
So there are many other applications.
And I'm not going to have time to go through all of them
today.
But here's just a flavor of some of the applications of graphs.
So any questions so far?
OK, so next, let's look at how we can
represent a graph in memory.
So for the rest of this lecture, I'm
going to assume that my vertices are labeled in the range from 0
to n minus 1.
So they have an integer in this range.
Sometimes, your graph might be given to you
where the vertices are already labeled in this range,
sometimes, not.
But you can always get these labels
by mapping each of the identifiers
to a unique integer in this range.
So for the rest of the lecture, I'm
just going to assume that we have these labels from 0
to n minus 1 for the vertices.
One way to represent a graph is to use an adjacency matrix.
So this is going to be n by n matrix.
And there's a 1 bit in i-th row in j-th column
if there's an edge that goes from vertex I to vertex J,
and 0 otherwise.
Another way to represent a graph is the edgeless representation,
where we just store a list of the edges that
appear in the graph.
So we have one pair for each edge,
where the pair contains the two coordinates of that edge.
So what is the space requirement for each
of these two representations in terms of the number of edges m
and the number of vertices n in the graph?
So it should be pretty easy.
Yes.
AUDIENCE: n squared for the [INAUDIBLE]
and m for the [INAUDIBLE].
JULIAN SHUN: Yes, so the space for the adjacency matrix
is order n squared because you have n
squared cells in this matrix.
And you have 1 bit for each of the cells.
For the edge list, it's going to be order m
because you have m edges.
And for each edge, you're storing a constant amount
of data in the edge list.
So here's another way to represent a graph.
This is known as the adjacency list format.
And idea here is that we're going
to have an array of pointers, 1 per vertex.
And each pointer points to a linked list storing
the edges for that vertex.
And the linked list is unordered in this example.
So what's the space requirement of this representation?
AUDIENCE: It's n plus m.
JULIAN SHUN: Yeah, so it's going to be order n plus m.
And this is because we have n pointers.
And the number of entries across all of the linked lists
is just equal to the number of edges in the graph, which is m.
What's one potential issue with this sort of representation
if you think in terms of cache performance?
Does anyone see a potential performance issue here?
Yeah.
AUDIENCE: So it could be [INAUDIBLE]..
JULIAN SHUN: Right.
So the issue here is that if you're
trying to loop over all of the neighbors of a vertex,
you're going to have to dereference the pointer
in every linked list node.
Because these are not contiguous in memory.
And every time you dereference linked lists node,
that's going to be a random access into memory.
So that can be bad for cache performance.
One way you can improve cache performance
is instead of using linked lists for each of these neighbor
lists, you can use an array.
So now you can store the neighbors just in this array,
and they'll be contiguous in memory.
One drawback of this approach is that it
becomes more expensive if you're trying to update the graph.
And we'll talk more about that later.
So any questions so far?
So what's another way to represent the graph that we've
seen in a previous lecture?
What's a more compressed or compact way
to represent a graph, especially a sparse graph?
So does anybody remember the compressed sparse row format?
So we looked at this in one of the early lectures.
And in that lecture, we used it to store a sparse matrix.
But you can also use it to store a sparse graph.
And as a reminder, we have two arrays in the compressed sparse
row, or CSR format.
We have the Offsets array and the Edges array.
The Offsets array stores an offset for each vertex
into the Edges array, telling us where
the edges for that particular vertex
begins in the Edges array.
So Offsets of i stores the offset
of where vertex i's edges start in the Edges array.
So in this example, vertex 0 has an offset of 0.
So its edges start at position 0 in the Edges array.
Vertex 1 has an offset of 4, so it
starts at index 4 in this Offsets array.
So with this representation, how can we
get the degree of a vertex?
So we're not storing the degree explicitly here.
Can we get the degree efficiently?
Yes.
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yeah, so you can get the degree of a vertex
just by looking at the difference
between the next offset and its own offset.
So for vertex 0, you can see that its degree is 4
because vertex 1's offset is 4, and vertex 0's offset is 0.
And similarly you can do that for all of the other vertices.
So what's the space usage of this representation?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Sorry, can you repeat?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yeah, so again, it's going to be order m plus n
because you need order n space for the Offsets array and order
m space for the Edges array.
You can also store values or weights on their edges.
One way to do this is to create an additional array of size m.
And then for edge i, you just store the weight
or the value in the i-th index of this additional array
that you created.
If you're always accessing the weight when you access an edge,
then it's actually better for a cache locality
to interleave the weights with the edge targets.
So instead of creating two arrays of size m,
you have one array of size 2m.
And every other entry is the weight.
And this improves cache locality because every time
you access an edge, its weight is going to be right next to it
in memory.
And it's going to likely be on the same cache line.
So that's one way to improve cache locality.
Any questions so far?
So let's look at some of the trade-offs
in these different graph representations
that we've looked at so far.
So here, I'm listing the storage costs
for each of these representations which
we already discussed.
This is also the cost for just scanning the whole graph in one
of these representations.
What's the cost of adding an edge in each
of these representations?
So for adjacency matrix, what's the cost of adding an edge?
AUDIENCE: Order 1.
JULIAN SHUN: So for adjacency matrix,
it's just order 1 to add an edge.
Because you have random access into this matrix,
so you just have to access to i, j-th entry
and flip the bit from 0 to 1.
What about for the edge list?
So assuming that the edge list is unordered,
so you don't have to keep the list in any sorted order.
Yeah.
AUDIENCE: I guess it's O of 1.
JULIAN SHUN: Yeah, so again, it's just O of 1
because you can just add it to the end of the edge list.
So that's a constant time.
What about for the adjacency list?
So actually, this depends on whether we're
using linked lists or arrays for the neighbor
lists of the vertices.
If we're using a linked list, adding an edge just
takes constant time because we can just
put it at the beginning of the linked list.
If we're using an array, then we actually
need to create a new array to make space for this edge
that we add.
And that's going to cost us a degree of v work
to do that because we have to copy all the existing edges
over to this new array and then add this new edge
to the end of that array.
Of course, you could amortize this cost
across multiple updates.
So if you run out of memory, you can
double the size of your array so you
don't have to create these new arrays too often.
But the cost for any individual addition
is still relatively expensive compared to, say, an edge list
or adjacency matrix.
And then finally, for the compressed sparse row format,
if you add an edge, in the worst case,
it's going to cost us order m plus n work.
Because we're going to have to reconstruct the entire Offsets
array and the entire Edges array in the worst case.
Because we have to put something in and then
shift-- in the Edges array, you have
to put something in and shift all of the values
to the right of that over by one location.
And then for the Offsets array, we
have to modify the offset for the particular vertex we're
adding an edge to and then the offsets for all
of the vertices after that.
So the compressed sparse row representation
is not particularly friendly to edge updates.
What about for deleting an edge from some vertex v?
So for adjacency matrix, again, it's
going to be constant time because you just
randomly access the correct entry
and flip the bit from 1 to 0.
What about for an edge list?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yeah, so for an edge list, in the worst case,
it's going to cost us order m work because the edges are not
in any sorted order.
So we have to scan through the whole thing in the worst case
to find the edge that we're trying to delete.
For adjacency list, it's going to take order degree of v work
because the neighbors are not sorted.
So we have to scan through the whole thing
to find this edge that we're trying to delete.
And then finally, for a compressed sparse row,
it's going to be order m plus n because we're
going to have to reconstruct the whole thing in the worst case.
What about finding all of the neighbors
of a particular vertex v?
What's the cost of doing this in the adjacency matrix?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yes, so it's going to cost us
order n work to find all the neighbors
of a particular vertex because we just
scan the correct row in this matrix, the row
corresponding to vertex v. For the edge list,
we're going to have to scan the entire edge
list because it's not sorted.
So in the worst case, that's going to be order m.
For adjacency list, that's going to take order degree of v
because we can just find a pointer to the linked
list for that vertex in constant time.
And then we just traverse over the linked list.
And that takes order degree of v time.
And then finally, for compressed sparse row format,
it's also order degree of v because we have constant time
access into the appropriate location in the Edges array.
And then we can just read off the edges, which
are consecutive in memory.
So what about finding if a vertex w is a neighbor of v?
So I'll just give you the answer.
So for the adjacency matrix, it's
going to take constant time because again,
we just have to check the v-th row in the w-th column
and check if the bit is set there.
For edge list, we have to traverse the entire list
to see if the edge is there.
And then for adjacency list and compressed sparse row,
it's going to be order degree of v
because we just have to scan the neighbor list for that vertex.
So these are some graph representations.
But there are actually many other graph representations,
including variance of the ones that I've talked about here.
So, for example, for the adjacency,
I said you can either use a linked list or an array
to store the neighbor list.
But you can actually use a hybrid approach, where
you store the linked list, but each linked list node actually
stores more than one vertex.
So you can store maybe 16 vertices
in each linked list node.
And that gives us better cache locality.
So for the rest of this lecture, I'm
going to talk about algorithms that
are best implemented using the compressed sparse row format.
And this is because we're going to be
dealing with sparse graphs.
We're going to be looking at static algorithms, where we
don't have to update the graph.
If we do have to update the graph,
then CSR isn't a good choice.
But we're just going to be looking at static algorithms
today.
And then for all the algorithms that we'll be looking at,
we're going to need to scan over all the neighbors of a vertex
that we visit.
And CSR is very good for that because all
of the neighbors for a particular vertex
are stored contiguously in memory.
So any questions so far?
OK, I do want to talk about some properties
of real-world graphs.
So first, we're seeing graphs that are quite large today.
But actually, they're not too large.
So here are the sizes of some of the real-world graphs
out there.
So there is a Twitter network.
That's actually a snapshot of the Twitter network
from a couple of years ago.
It has 41 million vertices and 1.5 billion edges.
And you can store this graph in about 6.3 gigabytes of memory.
So you can probably store it in the main memory of your laptop.
The largest publicly available graph out
there now is this Common Crawl web graph.
It has 3.5 billion vertices and 128 billion edges.
So storing this graph requires a little
over 1/2 terabyte of memory.
It is quite a bit of memory.
But it's actually not too big because there are machines out
there with main memory sizes in the order of terabytes
of memory nowadays.
So, for example, you can rent 2-terabyte or 4-terabyte memory
instance on AWS, which you're using for your homework
assignments.
See if you have any leftover credits
at the end of the semester, and you
want to play around on this graph,
you can rent one of these terabyte machines.
Just remember to turn it off when
you're done because it's kind of expensive.
Another property of real-world graphs
is that they're quite sparse.
So m tends to be much less than n squared.
So most of the possible edges are not actually there.
And finally, the degree distributions of the vertices
can be highly skewed in many real-world graphs.
So here I'm plotting the degree on the x-axis
and the number of vertices with that particular degree
on the y-axis.
And we can see that it's highly skewed.
And, for example, in a social network, most of the people
would be on the left-hand side, so their degree is not
that high.
And then we have some very popular people
on the right-hand side, where their degree is very high,
but we don't have too many of those.
So this is what's known as a power law degree distribution.
And there have been various studies
that have shown that many real-world graphs have
approximately a power law degree distribution.
And mathematically, this means that the number
of vertices with degree d is proportional to d
to the negative p.
So negative p is the exponent.
And for many graphs, the value of p lies between 2 and 3.
And this power law degree distribution
does have implications when we're
trying to implement parallel algorithms to process
these graphs.
Because with graphs that have a skewed degree distribution,
you could run into load and balance issues.
If you just parallelize across the vertices,
the number of edges they have can vary significantly.
Any questions?
OK, so now let's talk about how we can implement a graph
algorithm.
And I'm going to talk about the breadth-first search algorithm.
So how many of you have seen breadth-first search before?
OK, so about half of you.
I did talk about breadth-first search in a previous lecture,
so I was hoping everybody would raise their hands.
OK, so as a reminder, in the BFS algorithm,
we're given a source vertex s, and we
want to visit the vertices in order of their distance
from the source s.
And there are many possible outputs
that we might care about.
One possible output is, we just want
to report the vertices in the order
that they were visited by the breadth-first search traversal.
So let's say we have this graph here.
And our source vertex is D. So what's
one possible order in which we can traverse these vertices?
Now, I should specify that we should
traverse this graph in a breadth-first search manner.
So what's the first vertex we're going to explore?
AUDIENCE: D.
JULIAN SHUN: D. So we're first going
to look at D because that's our source vertex.
The second vertex, we can actually
choose between B, C, and E because all we care about
is that we're visiting these vertices
in the order of their distance from the source.
But these three vertices are all of the same distance.
So let's just pick B, C, and then E.
And then finally, I'm going to visit
vertex A, which has a distance of 2 from the source.
So this is one possible solution.
There are other possible solutions
because we could have visited E before we visited B and so on.
Another possible output that we might care about
is we might want to report the distance from each vertex
to the source vertex s.
So in this example here are the distances.
So D has a distance of 0; B,C, and E all have a distance of 1;
and A has a distance of 2.
We might also want to generate a breadth-first search tree where
each vertex in the tree has a parent which
is a neighbor in the previous level
of the breadth-first search.
Or in other words, the parent should
have a distance of 1 less than that vertex itself.
So here's an example of a breadth-first search tree.
And we can see that each of the vertices
has a parent whose breadth-first search distance is 1 less
than itself.
So the algorithms that I'm going to be talking about today
will generate the distances as well as the BFS tree.
And BFS actually has many applications.
So it's used as a subroutine in betweenness
centrality, which is a very popular graph mining
algorithm used to rank the importance of nodes
in a network.
And the importance of nodes here corresponds
to how many shortest paths go through that node.
Other applications include eccentricity estimation,
maximum flows.
Some max flow algorithms use BFS as a subroutine.
You can use BFS to crawl the web,
do cycle detection, garbage collection, and so on.
So let's now look at a serial BFS algorithm.
And here, I'm just going to show the pseudocode.
So first, we're going to initialize the distances
to all INFINITY.
And we're going to initialize the parents to be NIL.
And then we're going to create queue data structure.
We're going to set the distance of the root
to be 0 because the root has a distance of 0 to itself.
And then we're going to place the root onto this queue.
And then, while the queue is not empty,
we're going to dequeue the first thing in the queue.
We're going to look at all the neighbors of the current vertex
that we dequeued.
And for each neighbor, we're going
to check if its distance is INFINITY.
If the distance is INFINITY, that
means we haven't explored that neighbor yet.
So we're going to go ahead and explore it.
And we do so by setting its distance value
to be the current vertex's distance plus 1.
We're going to set the parent of that neighbor
to be the current vertex.
And then we'll place the neighbor onto the queue.
So it's some pretty simple algorithm.
And we're just going to keep iterating in this while loop
until there are no more vertices left in the queue.
So what's the work of this algorithm in terms of n and m?
So how much work are we doing per edge?
Yes.
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yeah, so assuming that the enqueue and dequeue
operators are constant time, then
we're doing constant amount of work per edge.
So summed across all edges, that's going to be order m.
And then we're also doing a constant amount
of work per vertex because we have to basically place it
onto the queue and then take it off the queue,
and then also initialize their value.
So the overall work is going to be order m plus n.
OK, so let's now look at some actual code
to implement the serial BFS algorithm using
the compressed sparse row format.
So first, I'm going to initialize two arrays--
parent and queue.
And these are going to be integer arrays of size n.
I'm going to initialize all of the parent entries
to be negative 1.
I'm going to place a source vertex onto the queue.
So it's going to appear at queue of 0, that's
the beginning of the queue.
And then I'll set the parent of the source vertex
to be the source itself.
And then I also have two integers
that point to the front and the back of the queue.
So initially, the front of the queue is at position 0,
and the back is at position 1.
And then while the queue is not empty--
and I can check that by checking if q_front is not
equal to q_back--
then I'm going to dequeue the first vertex in my queue.
I'm going to set current to be that vertex.
And then I'll increment q_front.
And then I'll compute the degree of that vertex, which
I can do by looking at the difference
between consecutive offsets.
And I also assume that Offsets of n
is equal to m, just to deal with the last vertex
And then I'm going to loop through all of the neighbors
for the current vertex.
And to access each neighbor, what I do
is I go into the Edges array.
And I know that my neighbors start at Offsets of current.
And therefore, to get the i-th neighbor,
I just do Offsets of current plus i.
That's my index into the Edges array.
Now I'm going to check if my neighbor has been explored yet.
And I can check that by checking if parent of neighbor
is equal to negative 1.
If it is, that means I haven't explored it yet.
And then I'll set a parent of neighbor to be current.
And then I'll place the neighbor onto the back of the queue
and increment q_back.
And I'm just going to keep repeating this while loop
until it becomes empty.
And here, I'm only generating the parent pointers.
But I could also generate the distances
if I wanted to with just a slight modification
of this code.
So any questions on how this code works?
OK, so here's a question.
What's the most expensive part of the code?
Can you point to one particular line here
that is the most expensive?
Yes.
AUDIENCE: I'm going to guess the [INAUDIBLE] that's gonna be all
over the place in terms of memory locations--
ngh equals Edges.
JULIAN SHUN: OK, so actually, it turns out
that that's not the most expensive part of this code.
But you're close.
So anyone have any other ideas?
Yes.
AUDIENCE: Is it looking up the parent array?
JULIAN SHUN: Yes, so it turns out that this line here,
where we're accessing parent of neighbor,
that turns out to be the most expensive.
Because whenever we access this parent array,
the neighbor can appear anywhere in memory.
So that's going to be a random access.
And if the parent array doesn't fit in our cache,
then that's going to cost us a cache miss almost every time.
This Edges array is actually mostly accessed sequentially.
Because for each vertex, all of its edges
are stored contiguously in memory,
we do have one random access into the Edges array per vertex
because we have to look up the starting location
for that vertex.
But it's not 1 per edge, unlike this check of the parent array.
That occurs for every edge.
So does that make sense?
So let's do a back-of-the-envelope
calculation to figure out how many cache misses we would
incur, assuming that we started with a cold cache.
And we also assume that n is much larger
than the size of the cache, so we can't fit
any of these arrays into cache.
We'll assume that a cache line has 64 bytes,
and integers are 4 bytes each.
So let's try to analyze this.
So the initialization will cost us n/16 cache misses.
And the reason here is that we're initializing this array
sequentially.
So we're accessing contiguous locations.
And this can take advantage of spatial locality.
On each cache line, we can fit 16 of the integers.
So overall, we're going to need n/16 cache misses just
to initialize this array.
We also need n/16 cache misses across the entire algorithm
to dequeue the vertex from the front of the queue.
Because again, this is going to be a sequential access
into this queue array.
And across all vertices, that's going to be n/16
cache misses because we can fit 16 integers on a cache line.
To compute the degree here, that's
going to take n cache misses overall.
Because each of these accesses to Offsets
array is going to be a random access.
Because we have no idea what the value of current here is.
It could be anything.
So across the entire algorithm, we're
going to need n cache misses to access this Offsets array.
And then to access this Edges array,
I claim that we're going to need at most 2n plus m/16 cache
misses.
So does anyone see where that bound comes from?
So where does the m/16 come from?
Yeah.
AUDIENCE: You have to access that at least once for an edge.
JULIAN SHUN: Right, so you have to pay m/16 because you're
accessing every edge once.
And you're accessing the Edges contiguously.
So therefore, across all Edges, that's
going to take m/16 cache misses.
But we also have to add 2n.
Because whenever we access the Edges for a particular vertex,
the first cache line might not only
contain that vertex's edges.
And similarly, the last cache line
that we access might also not just contain
that vertex's edges.
So therefore, we're going to waste the first cache
line and the last cache line in the worst case for each vertex.
And summed cross all vertices, that's going to be 2n.
So this is the upper bound, 2n plus m/16.
Accessing this parent array, that's
going to be a random access every time.
So we're going to incur a cache miss
in the worst case every time.
So summed across all edge accesses,
that's going to be m cache misses.
And then finally, we're going to pay
n/16 cache misses to enqueue the neighbor onto the queue
because these are sequential accesses.
So in total, we're going to incur at most 51/16 n
plus 17/16 16 m cache misses.
And if m is greater than 3n, then the second term
here is going to dominate.
And m is usually greater than 3n in most real-world graphs.
And the second term here is dominated by this random access
into the parent array.
So let's see if we can optimize this code so that we
get better cache performance.
So let's say we could fit a bit vector of size n into cache.
But we couldn't fit the entire parent array into cache.
What can we do to reduce the number of cache misses?
So does anyone have any ideas?
Yeah.
AUDIENCE: Is bitvector to keep track of which
vertices of other parents then [INAUDIBLE]??
JULIAN SHUN: Yeah, so that's exactly correct.
So we're going to use a bit vector
to store whether the vertex has been explored yet or not.
So we only need 1 bit for that.
We're not storing the parent ID in this bit vector.
We're just storing a bit to say whether that vertex has
been explored yet or not.
And then, before we check this parent array,
we're going to first check the bit vector
to see if that vertex has been explored yet.
And if it has been explored yet, we
don't even need to access this parent array.
If it hasn't been explored, then we
won't go ahead and access the parent entry of the neighbor.
But we only have to do this one time
for each vertex in the graph because we can only
visit each vertex once.
And therefore, we can reduce the number of cache
misses from m down to n.
So overall, this might improve the number of cache misses.
In fact, it does if the number of edges
is large enough relative to the number of vertices.
However, you do have to do a little bit more computation
because you have to do bit vector manipulation to check
this bit vector and then also to set the bit vector when
you explore a neighbor.
So here's the code using the bit vector optimization.
So here, I'm initializing this bit vector called visited.
It's of size, approximately, n/32.
And then I'm setting all of the bits
to 0, except for the source vertex, where
I'm going to set its bit to 1.
And I'm doing this bit calculation here
to figure out the bit for the source vertex.
And then now, when I'm trying to visit a neighbor,
I'm first going to check if the neighbor is visited
by checking this bit array.
And I can do this using this computation here--
AND visited of neighbor over 32, by this mask--
1 left shifted by neighbor mod 32.
And if that's false, that means the neighbor
hasn't been visited yet.
So I'll go inside this IF clause.
And then I'll set the visited bit
to be true using this statement here.
And then I do the same operations as I did before.
It turns out that this version is
faster for large enough values of m
relative to n because you reduce the number of cache
misses overall.
You still have to do this extra computation here,
this bit manipulation.
But if m is large enough, then the reduction
in number of cache misses outweighs
the additional computation that you have to do.
Any questions?
OK, so that was a serial implementation
of breadth-first search.
Now let's look at a parallel implementation.
So I'm first going to do an animation
of how a parallel breadth-first search algorithm would work.
The parallel reference search algorithm
is going to operate on frontiers,
where the initial frontier contains just a source vertex.
And on every iteration, I'm going
to explore all of the vertices on the frontier
and then place any unexplored neighbors
onto the next frontier.
And then I move on to the next frontier.
So in the first iteration, I'm going
to mark the source vertex as explored,
set its distance to be 0, and then place
the neighbors of that source vertex onto the next frontier.
In the next iteration, I'm going to do the same thing, set
these distances to 1.
I also am going to generate a parent pointer
for each of these vertices.
And this parent should come from the previous frontier,
and it should be a neighbor of the vertex.
And here, there's only one option,
which is the source vertex.
So I'll just pick that as the parent.
And then I'm going to place the neighbors
onto the next frontier again, mark those as explored,
set their distances, and generate a parent
pointer again.
And notice here, when I'm generating these parent
pointers, there's actually more than one choice
for some of these vertices.
And this is because there are multiple vertices
on the previous frontier.
And some of them explored the same neighbor
on the current frontier.
So a parallel implementation has to be
aware of this potential race.
Here, I'm just picking an arbitrary parent.
So as we see here, you can process each
of these frontiers in parallel.
So you can parallelize over all of the vertices on the frontier
as well as all of their outgoing edges.
However, you do need to process one frontier before you
move on to the next one in this BFS algorithm.
And a parallel implementation has
to be aware of potential races.
So as I said earlier, we could have multiple vertices
on the frontier trying to visit the same neighbors.
So somehow, that has to be resolved.
And also, the amount of work on each frontier
is changing throughout the course of the algorithm.
So you have to be careful with load balancing.
Because you have to make sure that the amount of work
each processor has to do is about the same.
If you use Cilk to implement this,
then load balancing doesn't really become a problem.
So any questions on the BFS algorithm
before I go over the code?
OK, so here's the actual code.
And here I'm going to initialize these four arrays, so
the parent array, which is the same as before.
I'm going to have an array called frontier, which
stores the current frontier.
And then I'm going to have an array
called frontierNext, which is a temporary array
that I use to store the next frontier of the BFS.
And then also I have an array called degrees.
I'm going to initialize all of the parent entries
to be negative 1.
I do that using a cilk_for loop.
I'm going to place the source vertex at the 0-th index
of the frontier.
I'll set the frontierSize to be 1.
And then I set the parent of the source to be the source itself.
While the frontierSize is greater than 0,
that means I still have more work to do.
I'm going to first iterate over all
of the vertices on my frontier in parallel using a cilk_for
loop.
And then I'll set the i-th entry of the degrees array
to be the degree of the i-th vertex on the frontier.
And I can do this just using the difference
between consecutive offsets.
And then I'm going to perform a prefix sum on this degrees
array.
And we'll see in a minute why I'm doing this prefix sum.
But first of all, does anybody recall what prefix sum is?
So who knows what prefix sum is?
Do you want to tell us what it is?
AUDIENCE: That's the sum array where index i is the sum of
[INAUDIBLE].
JULIAN SHUN: Yeah, so prefix sum--
so here I'm going to demonstrate this with an example.
So let's say this is our input array.
The output of this array would store for each location
the sum of everything before that location in the input
array.
So here we see that the first position has a value of 0
because a sum of everything before it is 0.
There's nothing before it in the input.
The second position has a value of 2
because the sum of everything before it is just
the first location.
The third location has a value of 6
because the sum of everything before it is 2
plus 4, which is 6, and so on.
So I believe this was on one of your homework assignments.
So hopefully, everyone knows what prefix sum is.
And later on, we'll see how we use
this to do the parallel breadth-first search.
OK, so I'm going to do a prefix sum on this degrees array.
And then I'm going to loop over my frontier again in parallel.
I'm going to let v be the i-th vertex on the frontier.
Index is going to be equal to degrees of i.
And then my degree is going to be Offsets of v
plus 1 minus Offsets of v.
Now I'm going to loop through all v's neighbors.
And here I just have a serial for loop.
But you could actually parallelize this for loop.
It turns out that if the number of iterations in the for loop
is small enough, there's additional overhead
to making this parallel, so I just made it serial for now.
But you could make it parallel.
To get the neighbor, I just index into this Edges array.
I look at Offsets of v plus j.
Then now I'm going to check if the neighbor has
been explored yet.
And I can check if parent of neighbor
is equal to negative 1.
So that means it hasn't been explored yet, so I'm
going to try to explore it.
And I do so using a compare-and-swap.
I'm going to try to swap in the value of v
with the original value of negative 1
in parent of neighbor.
And the compare-and-swap is going
to return true if it was successful and false otherwise.
And if it returns true, that means
this vertex becomes the parent of this neighbor.
And then I'll place the neighbor on
to frontierNext at this particular index--
index plus j.
And otherwise, I'll set a negative 1 at that location.
OK, so let's see why I'm using index plus j here.
So here's how frontierNext is organized.
So each vertex on the frontier owns
a subset of these locations in the frontierNext array.
And these are all contiguous memory locations.
And it turns out that the starting location
for each of these vertices in this frontierNext array
is exactly the value in this prefix sum array up here.
So vertex 1 has its first location at index 0.
Vertex 2 has its first location at index 2.
Vertex 3 has its first location at index 6, and so on.
So by using a prefix sum, I can guarantee
that all of these vertices have a disjoint subarray
in this frontierNext array.
And then they can all write to this frontierNext array
in parallel without any races.
And index plus j just gives us the right location
to write to in this array.
So index is the starting location,
and then j is for the j-th neighbor.
So here is one potential output after we write
to this frontierNext array.
So we have some non-negative values.
And these are vertices that we explored in this iteration.
We also have some negative 1 values.
And the negative 1 here means that either the vertex has
already been explored in a previous iteration,
or we tried to explore it in the current iteration,
but somebody else got there before us.
Because somebody else is doing the compare-and-swap
at the same time, and they could have finished before we did,
so we failed on the compare-and-swap.
So we don't actually want these negative 1 values, so we're
going to filter them out.
And we can filter them out using a prefix sum again.
And this is going to give us a new frontier.
And we'll set the frontierSize equal to the size
of this new frontier.
And then we repeat this while loop
until there are no more vertices on the frontier.
So any questions on this parallel BFS algorithm?
Yeah.
AUDIENCE: Can you go over like the last [INAUDIBLE]??
JULIAN SHUN: Do you mean the filter out?
AUDIENCE: Yeah.
JULIAN SHUN: Yeah, so what you can do
is, you can create another array, which stores a 1
in location i if that location is not a negative 1 and 0
if it is a negative 1.
Then you do a prefix sum on that array,
which gives us unique offsets into an output array.
So then everybody just looks at the prefix sum array there.
And then it writes to the output array.
So it might be easier if I tried to draw this on the board.
OK, so let's say we have an array of size 5 here.
So what I'm going to do is I'm going
to generate another array which stores
a 1 if the value in the corresponding location
is not a negative 1 and 0 otherwise.
And then I do a prefix sum on this array here.
And this gives me 0, 1, 1, 2, and 2.
And now each of these values that are not negative 1,
they can just look up the corresponding index
in this output array.
And this gives us a unique index into an output array.
So this element will write to position 0,
this element would write to position 1,
and this element would write to position 2 in my final output.
So this would be my final frontier.
Does that make sense?
OK, so let's now analyze the working span
of this parallel BFS algorithm.
So a number of iterations required by the BFS algorithm
is upper-bounded by the diameter D of the graph.
And the diameter of a graph is just the maximum shortest
path between any pair of vertices in the graph.
And that's an upper bound on the number of iterations
we need to do.
Each iteration is going to take a log m
span for the clik_for loops, the prefix sum, and the filter.
And this is also assuming that the inner loop
is parallelized, the inner loop over the neighbors of a vertex.
So to get the span, we just multiply these two terms.
So we get theta of D times log m span.
What about the work?
So to compute the work, we have to figure out
how much work we're doing per vertex and per edge.
So first, notice that the sum of the frontier
sizes across entire algorithm is going
to be n because each vertex can be on the frontier at most
once.
Also, each edge is going to be traversed exactly once.
So that leads to m total edge visits.
On each iteration of the algorithm,
we're doing a prefix sum.
And the cost of this prefix sum is
going to be proportional to the frontier size.
So summed across all iterations, the cost of the prefix
sum is going to be theta of n.
We also have to do this filter.
But the work of the filter is proportional to the number
of edges traversed in that iteration.
And summed across all iterations, that's
going to give theta of m total.
So overall, the work is going to be
theta of n plus m for this parallel BFS algorithm.
So this is a work-efficient algorithm.
The work matches out the serial algorithm.
Any questions on the analysis?
OK, so let's look at how this parallel BFS
algorithm runs in practice.
So here, I ran some experiments on a random graph
with 10 million vertices and 100 million edges.
And the edges were randomly generated.
And I made sure that each vertex had 10 edges.
I ran experiments on a 40-core machine
with 2-way hyperthreading.
Does anyone know what hyperthreading is?
Yeah, what is it?
AUDIENCE: It's when you have like one CPU core that
can execute two instruction screens at the same time
so it can [INAUDIBLE] high number latency.
JULIAN SHUN: Yeah, so that's a great answer.
So hyperthreading is an Intel technology
where for each physical core, the operating system actually
sees it as two logical cores.
They share many of the same resources,
but they have their own registers.
So if one of the logical cores stalls on a long latency
operation, the other logical core
can use the shared resources and hide some of the latency.
OK, so here I am plotting the speedup
over the single-threaded time of the parallel algorithm
versus the number of threads.
So we see that on 40 threads, we get
a speedup of about 22 or 23X.
And when we turn on hyperthreading
and use all 80 threads, the speedup is about 32 times
on 40 cores.
And this is actually pretty good for a parallel graph algorithm.
It's very hard to get very good speedups
on these irregular graph algorithms.
So 32X on 40 cores is pretty good.
I also compared this to the serial BFS algorithm
because that's what we ultimately
want to compare against.
So we see that on 80 threads, the speedup over the serial BFS
is about 21, 22X.
And the serial BFS is 54% faster than the parallel BFS
on one thread.
This is because it's doing less work than the parallel version.
The parallel version has to do actual work with the prefix
sum in the filter, whereas the serial version doesn't
have to do that.
But overall, the parallel implementation
is still pretty good.
OK, questions?
So a couple of lectures ago, we saw this slide here.
So Charles told us never to write
nondeterministic parallel programs because it's
very hard to debug these programs and hard to reason
about them.
So is there nondeterminism in this BFS code
that we looked at?
AUDIENCE: You have nondeterminism
in the compare-and-swap.
JULIAN SHUN: Yeah, so there's nondeterminism
in the compare-and-swap.
So let's go back to the code.
So this compare-and-swap here, there's
a race there because we get multiple vertices trying
to write to the parent entry of the neighbor at the same time.
And the one that wins is nondeterministic.
So the BFS tree that you get at the end is nondeterministic.
OK, so let's see how we can try to fix this nondeterminism.
OK so, as we said, this is a line
that causes the nondeterminism.
It turns out that we can actually make the output BFS
tree, be deterministic by going over
the outgoing edges in each iteration in two phases.
So how this works is that in the first phase,
the vertices on the frontier are not actually
going to write to the parent array.
Or they are going to write, but they're
going to be using this writeMin operator.
And the writeMin operator is an atomic operation
that guarantees that we have concurrent writes
to the same location.
The smallest value gets written there.
So the value that gets written there
is going to be deterministic.
It's always going to be the smallest
one that tries to write there.
Then in the second phase, each vertex
is going to check for each neighbor
whether a parent of neighbor is equal to v. If it is,
that means it was the vertex that successfully wrote
to parent of neighbor in the first phase.
And therefore, it's going to be responsible for placing
this neighbor onto the next frontier.
And we're also going to set parent of neighbor
to be negative v. This is just a minor detail.
And this is because when we're doing this writeMin operator,
we could have a future iteration where a lower vertex tries
to visit the same vertex that we already explored.
But if we set this to a negative value,
we're only going to be writing non-negative values
to this location.
So the writeMin on a neighbor that has already been explored
would never succeed.
OK, so the final BFS tree that's generated by this code
is always going to be the same every time you run it.
I want to point out that this code is still
notdeterministic with respect to the order
in which individual memory locations get updated.
So you still have a deterministic race here
in the writeMin operator.
But it's still better than a nondeterministic code
in that you always get the same BFS tree.
So how do you actually implement the writeMin operation?
So it turns out you can implement this using
a loop with a compare-and-swap.
So writeMin takes as input two arguments--
the memory address that we're trying to update
and the new value that we want to write to that address.
We're first going to set oldval equal to the value
at that memory address.
And we're going to check if newval is less than oldval.
If it is, then we're going to attempt
to do a compare-and-swap at that location,
writing newval into that address if its initial value
was oldval.
And if that succeeds, then we return.
Otherwise, we failed.
And that means that somebody else came in the meantime
and changed the value there.
And therefore, we have to reread the old value
at the memory address.
And then we repeat.
And there are two ways that this writeMin operator could finish.
One is if the compare-and-swap was successful.
The other one is if newval is greater than
or equal to oldval.
In that case, we no longer have to try to write anymore
because the value that's there is already smaller than what
we're trying to write.
So I implemented an optimized version
of this deterministic parallel BFS code
and compared it to the nondeterministic version.
And it turns out on 32 cores, it's
only a little bit slower than the nondeterministic version.
So it's about 5% to 20% slower on a range of different input
graphs.
So this is a pretty small price to pay for determinism.
And you get many nice benefits, such as ease
of debugging and ease of reasoning about the performance
of your code.
Any questions?
OK, so let me talk about another optimization
for breadth-first search.
And this is called the direction optimization.
And the idea is motivated by how the sizes of the frontiers
change in a typical BFS algorithm over time.
So here I'm plotting the frontier size
on the y-axis in log scale.
And the x-axis is the iteration number.
And on the left, we have a random graph, on the right,
we have a parallel graph.
And we see that the frontier size actually
grows pretty rapidly, especially for the power law graph.
And then it drops pretty rapidly.
So this is true for many of the real-world graphs
that we see because many of them look like power law graphs.
And in the BFS algorithm, most of the work
is done when the frontier is relatively large.
So most of the work is going to be
done in these middle iterations where
the frontier is very large.
And it turns out that there are two ways
to do breadth-first search.
One way is the traditional way, which
I'm going to refer to as the top-down method.
And this is just what we did before.
We look at the frontier vertices,
and explore all of their outgoing neighbors,
and mark any of the unexplored ones as explored,
and place them on to the next frontier.
But there's actually another way to do breadth-first search.
And this is known as the bottom-up method.
And in the bottom-up method, I'm going
to look at all of the vertices in the graph that
haven't been explored yet, and I'm
going to look at their incoming edges.
And if I find an incoming edge that's on the current frontier,
I can just say that that incoming neighbor is my parent.
And I don't even need to look at the rest
of my incoming neighbors.
So in this example here, vertices 9 through 12,
when they loop through their incoming edges,
they found incoming neighbor on the frontier,
and they chose that neighbor as their parent.
And they get marked as explored.
And we can actually save some edge traversals here because,
for example, if you look at vertex 9,
and you imagine the edges being traversed
in a top-to-bottom manner, then vertex 9 is only
going to look at its first incoming edge
and find the incoming neighbors on the frontier.
So it doesn't even need to inspect
the rest of the incoming edges because all
we care about finding is just one parent in the BFS tree.
We don't need to find all of the possible parents.
In this example here, vertices 13 through 15 actually ended up
wasting work because they looked at all of their incoming edges.
And none of the incoming neighbors are on the frontier.
So they don't actually find a neighbor.
So the bottom-up approach turns out
to work pretty well when the frontier is large
and many vertices have been already explored.
Because in this case, you don't have to look at many vertices.
And for the ones that you do look at,
when you scan over their incoming edges,
it's very likely that early on, you'll
find a neighbor that is on the current frontier,
and you can skip a bunch of edge traversals.
And the top-down approach is better
when the frontier is relatively small.
And in a paper by Scott Beamer in 2012,
he actually studied the performance
of these two approaches in BFS.
And this plot here plots the running time
versus the iteration number for a power law graph
and compares the performance of the top-down and bottom-up
approach.
So we see that for the first two steps,
the top-down approach is faster than the bottom-up approach.
But then for the next couple of steps,
the bottom-up approach is faster than a top-down approach.
And then when we get to the end, the top-down approach
becomes faster again.
So the top-down approach, as I said,
is more efficient for small frontiers,
whereas a bottom-up approach is more
efficient for large frontiers.
Also, I want to point out that in the top-down approach, when
we update the parent array, that actually has to be atomic.
Because we can have multiple vertices trying
to update the same neighbor.
But in a bottom-up approach, the update to the parent array
doesn't have to be atomic.
Because we're scanning over the incoming neighbors
of any particular vertex v serially.
And therefore, there can only be one processor
that's writing to parent of v.
So we choose between these two approaches based
on the size of the frontier.
We found that a threshold of a frontier size of about n/20
works pretty well in practice.
So if the frontier has more than n/20 vertices,
we used a bottom up approach.
And otherwise, we used a top-down approach.
You can also use more sophisticated thresholds,
such as also considering the sum of out-degrees,
since the actual work is dependent on the sum
of out-degrees of the vertices on the frontier.
You can also use different thresholds
for going from top-down to bottom-up and then
another threshold for going from bottom-up back to top-down.
And in fact, that's what the original paper did.
They had two different thresholds.
We also need to generate the inverse graph
or the transposed graph if we're using this method
if the graph is directed.
Because if the graph is directed,
in the bottom-up approach, we actually
need to look at the incoming neighbors, not
the outgoing neighbors.
So if the graph wasn't already symmetrized,
then we have to generate both the incoming neighbors
and outgoing neighbors for each vertex.
So we can do that as a pre-processing step.
Any questions?
OK, so how do we actually represent the frontier?
So one way to represent the frontier
is just use a sparse integer array,
which is what we did before.
Another way to do this is to use a dense array.
So, for example, here I have an array of bytes.
The array is of size n, where n is the number of vertices.
And I have a 1 in position i if vertex i
is on the frontier and 0 otherwise.
I can also use a bit vector to further compress this
and then use additional bit level operations to access it.
So for the top-down approach, a sparse representation
is better because the top-down approach usually
deals with small frontiers.
And if we use a sparse array, we only
have to do work proportional to the number of vertices
on the frontier.
And then in the bottom-up approach,
it turns out that dense representation is better
because we're looking at most of the vertices anyways.
And then we need to switch between these two methods based
on the approach that we're using.
So here's some performance numbers comparing the three
different modes of traversal.
So we have bottom-up, top-down, and then
the direction optimizing approach
using a threshold of n/20.
First of all, we see that the bottom-up approach
is the slowest for both of these graphs.
And this is because it's doing a lot of wasted work
in the early iterations.
We also see that the direction optimizing approach is always
faster than both the top-down and the bottom-up approach.
This is because if we switch to the bottom-up approach
at an appropriate time, then we can
save a lot of edge traversals.
And, for example, you can see for the power law graph,
the direction optimizing approach
is almost three times faster than the top-down approach.
The benefits of this approach are highly
dependent on the input graph.
So it works very well for power law and random graphs.
But if you have graphs where the frontier size is always small,
such as a grid graph or a road network,
then you would never use a bottom-up approach.
So this wouldn't actually give you any performance gains.
Any questions?
So it turns out that this direction optimizing
idea is more general than just breadth-first search.
So a couple years ago, I developed
this framework called Ligra, where I generalized
the direction optimizing idea to other graph algorithms,
such as betweenness centrality, connected components, sparse
PageRank, shortest paths, and so on.
And in the Ligra framework, we have an EDGEMAP operator
that chooses between a sparse implementation
and a dense implementation based on the size of the frontier.
So the sparse here corresponds to the top-down approach.
And dense corresponds to the bottom-up approach.
And it turns out that using this direction optimizing
idea for these other applications
also gives you performance gains in practice.
OK, so let me now talk about another optimization, which
is graph compression.
And the goal here is to reduce the amount of memory usage
in the graph algorithm.
So recall, this was our CSR representation.
And in the Edges array, we just stored
the values of the target edges.
Instead of storing the actual targets,
we can actually do better by first sorting the edges so
that they appear in non-decreasing order
and then just storing the differences
between consecutive edges.
And then for the first edge for any particular vertex,
we'll store the difference between the target
and the source of that edge.
So, for example, here, for vertex 0,
the first edge is going to have a value of 2
because we're going to take the difference between the target
and the source.
So 2 minus 0 is 2.
Then for the next edge, we're going
to take the difference between the second edge
and the first edge, so 7 minus 2, which is 5.
And then similarly we do that for all of the remaining edges.
Notice that there are some negative values here.
And this is because the target is smaller than the source.
So in this example, 1 is smaller than 2.
So if you do 1 minus 2, you get a negative--
negative 1.
And this can only happen for the first edge
for any particular vertex because for all
the other edges, we're encoding the difference
between that edge and the previous edge.
And we already sorted these edges
so that they appear in non-decreasing order.
OK, so this compressed edges array
will typically contain smaller values
than this original edges array.
So now we want to be able to use fewer bits
to represent these values.
We don't want to use 32 or 64 bits like we did before.
Otherwise, we wouldn't be saving any space.
So one way to reduce the space usage
is to store these values using what's called a variable length
code or a k-bit code.
And the idea is to encode each value in chunks of k bits,
where for each chunk, we use k minus 1 bits for the data and 1
bit as the continue bit.
So for example, let's encode the integer 401
using 8-bit or byte codes.
So first, we're going to write this value out in binary.
And then we're going to take the bottom 7 bits,
and we're going to place that into the data
field of the first chunk.
And then in the last bit of this chunk,
we're going to check if we still have any more
bits that we need to encode.
And if we do, then we're going to set a 1 in the continue bit
position.
And then we create another chunk.
We'll replace the next 7 bits into the data
field of that chunk.
And then now we're actually done encoding this integer value.
So we can place a 0 in the continue bit.
So that's how the encoding works.
And decoding is just doing this process backwards.
So you read chunks until you find a chunk with a 0
continue bit.
And then you shift all of the data values
left accordingly and sum them together
to reconstruct the integer value that you encoded.
One performance issue that might occur here
is that when you're decoding, you
have to check this continue bit for every chunk
and decide what to do based on that continue bit.
And this is actually unpredictable branch.
So you can suffer from branch mispredictions
from checking this continue bit.
So one way you can optimize this is to get rid of these
continue bits.
And the idea here is to first figure out
how many bytes you need to encode
each integer in the sequence.
And then you group together integers
that require the same number of bytes to encode.
Use a run-length encoding idea to encode all of these integers
together by using a header byte, where in the header byte,
you use the lower 6 bits to store the size of the group
and the highest 2 bits to store the number of bytes each
of these integers needs to decode.
And now all of the integers in this group
will just be stored after this header byte.
And we'd know exactly how many bytes they need to decode.
So we don't need to store a continue bit in these chunks.
This does slightly increase the space usage.
But it makes decoding cheaper because we no longer have
to suffer from branch mispredictions
from checking this continue bit.
OK, so now we have to decode these edge lists on the fly
as we're running our algorithm.
If we decoded everything at the beginning,
we wouldn't actually be saving any space.
We need to decode these edges as we access them
in our algorithm.
Since we encoded all of these edge
lists separately for each vertex,
we can decode all of them in parallel.
And each vertex just decodes its edge list sequentially.
But what about high-degree vertices?
If you have a high-degree vertex,
you stop to decode its edge list sequentially.
And if you're running this in parallel,
this could lead to load imbalance.
So one way to fix this is, instead of just encoding
the whole thing sequentially, you can chunk it up
into chunks of size T. And then for each chunk,
you encode it like you did before,
where you store the first value relative to the source vertex
and then all of the other values relative to the previous edge.
And now you can actually decode the first value
here for each of these chunks all in parallel
without having to wait for the previous edge to be decoded.
And then this gives us much more parallelism
because all of these chunks can be decoded in parallel.
And we found that a value of T-- where T is the chunk size--
between 100 and 10,000 works pretty well in practice.
OK, so I'm not going to have time
to go over the experiments.
But at a high level, the experiments
show that compression schemes do save space.
And serially, it's only slightly slower
than the uncompressed version.
But surprisingly, when you run it in parallel,
it actually becomes faster than the uncompressed version.
And this is because these graph algorithms are memory bound.
And we're using less memory.
You can alleviate this memory subsystem bottleneck
and get better scalability.
And the decoding part of these compressed algorithms
actually gets very good parallel speedup
because they're just doing local operations.
OK, so let me summarize now.
So we saw some properties of real-world graphs.
We saw that they're quite large, but they can still
fit on a multi-core server.
And they're relatively sparse.
They also have a power law degree distribution.
Many graph algorithms are irregular in that they involve
many random memory accesses.
So that becomes a bottleneck of the performance
of these algorithms.
And you can improve performance with algorithmic optimization,
such as using this direction optimization
and also by creating and exploiting
locality, for example, by using this bit vector optimization.
And finally, optimizations for graphs
might work well for certain graphs,
but they might not work well for other graphs.
For example, the direction optimization idea
works well for power law graphs but not for road graphs.
So when you're trying to optimize your graph algorithm,
we should definitely test it on different types of graphs
and see where it works well and where it doesn't work.
So that's all I have.
If you have any additional questions,
please feel free to ask me after class.
And as a reminder, we have a guest lecture on Thursday
by Professor Johnson of the MIT Math Department.
And he'll be talking about high-level languages,
so please be sure to attend.