Placeholder Image

Subtitles section Play video

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high-quality educational resources for free.

  • To make a donation or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • JULIAN SHUN: Hi, good afternoon, everyone.

  • So today, we're going to be talking

  • about graph optimizations.

  • And as a reminder, on Thursday, we're

  • going to have a guest lecture by Professor Johnson of the MIT

  • Math Department.

  • And he'll be talking about performance

  • of high-level languages.

  • So please be sure to attend the guest lecture on Thursday.

  • So here's an outline of what I'm going

  • to be talking about today.

  • So we're first going to remind ourselves what a graph is.

  • And then we're going to talk about various ways

  • to represent a graph in memory.

  • And then we'll talk about how to implement

  • an efficient breadth-first search algorithm, both serially

  • and also in parallel.

  • And then I'll talk about how to use graph compression and graph

  • reordering to improve the locality of graph algorithms.

  • So first of all, what is a graph?

  • So a graph contains vertices and edges,

  • where vertices represent certain objects of interest,

  • and edges between objects model relationships between the two

  • objects.

  • For example, you can have a social network,

  • where the people are represented as vertices

  • and edges between people mean that they're

  • friends with each other.

  • The edges in this graph don't have to be bi-directional.

  • So you could have a one-way relationship.

  • For example, if you're looking at the Twitter network,

  • Alice could follow Bob, but Bob doesn't necessarily

  • have to follow Alice back.

  • The graph also doesn't have to be connected.

  • So here, this graph here is connected.

  • But, for example, there could be some people

  • who don't like to talk to other people.

  • And then they're just off in their own component.

  • You can also use graphs to model protein networks, where

  • the vertices are proteins, and edges between vertices

  • means that there's some sort of interaction

  • between the proteins.

  • So this is useful in computational biology.

  • As I said, edges can be directed,

  • so their relationship can go one way or both ways.

  • In this graph here, we have some directed edges and then also

  • some edges that are directed in both directions.

  • So here, John follows Alice.

  • Alice follows Peter.

  • And then Alice follows Bob, and Bob also follows Alice.

  • If you use a graph to represent the world wide web,

  • then the vertices would be websites,

  • and then the edges would denote that there is a hyperlink

  • from one website to another.

  • And again, the edges here don't have to be bi-directional

  • because website A could have a link to website B.

  • But website B doesn't necessarily

  • have to have a link back.

  • Edges can also be weighted.

  • So you can have a weight on the edge that

  • denotes the strength of the relationship

  • or some sort of distance measure corresponding

  • to that relationship.

  • So here, I have an example where I am using a graph

  • to represent cities.

  • And the edges between cities have

  • a weight that corresponds to the distance between the two

  • cities.

  • And if I want to find the quickest way to get from city A

  • to city B, then I would be interested in finding

  • the shortest path from A to B in this graph here.

  • Here's another example, where the edge weights now

  • are the costs of a direct flight from city A to city B.

  • And here the edges are directed.

  • So, for example, this says that there's

  • a flight from San Francisco to LA for $45.

  • And if I want to find the cheapest

  • way to get from one city to another city,

  • then, again, I would try to find the shortest path in this graph

  • from city A to city B.

  • Vertices and edges can also have metadata on them,

  • and they can also have types.

  • So, for example, here's the Google Knowledge Graph,

  • which represents all the knowledge on the internet

  • that Google knows about.

  • And here, the nodes have metadata on them.

  • So, for example, the node corresponding to da Vinci

  • is labeled with his date of birth and date of death.

  • And the vertices also have a color

  • corresponding to the type of knowledge that they refer to.

  • So you can see that some of these nodes are blue,

  • some of them are red, some of them are green,

  • and some of them have other things on them.

  • So in general, graphs can have types and metadata

  • on both the vertices as well as the edges.

  • Let's look at some more applications of graphs.

  • So graphs are very useful for implementing queries

  • on social networks.

  • So here are some examples of queries

  • that you might want to ask on a social network.

  • So, for example, you might be interested in finding

  • all of your friends who went to the same high school as you

  • on Facebook.

  • So that can be implemented using a graph algorithm.

  • You might also be interested in finding

  • all of the common friends you have with somebody else--

  • again, a graph algorithm.

  • And a social network service might run a graph algorithm

  • to recommend people that you might know and want

  • to become friends with.

  • And they might use a graph algorithm

  • to recommend certain products that you

  • might be interested in.

  • So these are all examples of social network queries.

  • And there are many other queries that you

  • might be interested in running on a social network.

  • And many of them can be implemented

  • using graph algorithms.

  • Another important application is clustering.

  • So here, the goal is to find groups of vertices

  • in a graph that are well-connected

  • internally and poorly-connected externally.

  • So in this image here, each blob of vertices of the same color

  • corresponds to a cluster.

  • And you can see that inside a cluster,

  • there are a lot of edges going among the vertices.

  • And between clusters, there are relatively fewer edges.

  • And some applications of clustering

  • include community detection and social networks.

  • So here, you might be interested in finding

  • groups of people with similar interests or hobbies.

  • You can also use clustering to detect fraudulent websites

  • on the internet.

  • You can use it for clustering documents.

  • So you would cluster documents that

  • have similar text together.

  • And clustering is often used for unsupervised learning

  • and machine learning applications.

  • Another application is connectomics.

  • So connectomics is the study of the structure, the network

  • structure of the brain.

  • And here, the vertices correspond to neurons.

  • And edges between two vertices means

  • that there's some sort of interaction between the two

  • neurons.

  • And recently, there's been a lot of work

  • on trying to do high-performance connectomics.

  • And some of this work has been going on here at MIT

  • by Professor Charles Leiserson and Professor Nir Shavit's

  • research group.

  • So recently, this has been a very hot area.

  • Graphs are also used in computer vision--

  • for example, in image segmentation.

  • So here, you want to segment your image

  • into the distinct objects that appear in the image.

  • And you can construct a graph by representing the pixels

  • as vertices.

  • And then you would place an edge between every pair

  • of neighboring pixels with a weight that corresponds

  • to their similarity.

  • And then you would run some sort of minimum cost cut algorithm

  • to partition your graph into the different objects that

  • appear in the image.

  • So there are many other applications.

  • And I'm not going to have time to go through all of them

  • today.

  • But here's just a flavor of some of the applications of graphs.

  • So any questions so far?

  • OK, so next, let's look at how we can

  • represent a graph in memory.

  • So for the rest of this lecture, I'm

  • going to assume that my vertices are labeled in the range from 0

  • to n minus 1.

  • So they have an integer in this range.

  • Sometimes, your graph might be given to you

  • where the vertices are already labeled in this range,

  • sometimes, not.

  • But you can always get these labels

  • by mapping each of the identifiers

  • to a unique integer in this range.

  • So for the rest of the lecture, I'm

  • just going to assume that we have these labels from 0

  • to n minus 1 for the vertices.

  • One way to represent a graph is to use an adjacency matrix.

  • So this is going to be n by n matrix.

  • And there's a 1 bit in i-th row in j-th column

  • if there's an edge that goes from vertex I to vertex J,

  • and 0 otherwise.

  • Another way to represent a graph is the edgeless representation,

  • where we just store a list of the edges that

  • appear in the graph.

  • So we have one pair for each edge,

  • where the pair contains the two coordinates of that edge.

  • So what is the space requirement for each

  • of these two representations in terms of the number of edges m

  • and the number of vertices n in the graph?

  • So it should be pretty easy.

  • Yes.

  • AUDIENCE: n squared for the [INAUDIBLE]

  • and m for the [INAUDIBLE].

  • JULIAN SHUN: Yes, so the space for the adjacency matrix

  • is order n squared because you have n

  • squared cells in this matrix.

  • And you have 1 bit for each of the cells.

  • For the edge list, it's going to be order m

  • because you have m edges.

  • And for each edge, you're storing a constant amount

  • of data in the edge list.

  • So here's another way to represent a graph.

  • This is known as the adjacency list format.

  • And idea here is that we're going

  • to have an array of pointers, 1 per vertex.

  • And each pointer points to a linked list storing

  • the edges for that vertex.

  • And the linked list is unordered in this example.

  • So what's the space requirement of this representation?

  • AUDIENCE: It's n plus m.

  • JULIAN SHUN: Yeah, so it's going to be order n plus m.

  • And this is because we have n pointers.

  • And the number of entries across all of the linked lists

  • is just equal to the number of edges in the graph, which is m.

  • What's one potential issue with this sort of representation

  • if you think in terms of cache performance?

  • Does anyone see a potential performance issue here?

  • Yeah.

  • AUDIENCE: So it could be [INAUDIBLE]..

  • JULIAN SHUN: Right.

  • So the issue here is that if you're

  • trying to loop over all of the neighbors of a vertex,

  • you're going to have to dereference the pointer

  • in every linked list node.

  • Because these are not contiguous in memory.

  • And every time you dereference linked lists node,

  • that's going to be a random access into memory.

  • So that can be bad for cache performance.

  • One way you can improve cache performance

  • is instead of using linked lists for each of these neighbor

  • lists, you can use an array.

  • So now you can store the neighbors just in this array,

  • and they'll be contiguous in memory.

  • One drawback of this approach is that it

  • becomes more expensive if you're trying to update the graph.

  • And we'll talk more about that later.

  • So any questions so far?

  • So what's another way to represent the graph that we've

  • seen in a previous lecture?

  • What's a more compressed or compact way

  • to represent a graph, especially a sparse graph?

  • So does anybody remember the compressed sparse row format?

  • So we looked at this in one of the early lectures.

  • And in that lecture, we used it to store a sparse matrix.

  • But you can also use it to store a sparse graph.

  • And as a reminder, we have two arrays in the compressed sparse

  • row, or CSR format.

  • We have the Offsets array and the Edges array.

  • The Offsets array stores an offset for each vertex

  • into the Edges array, telling us where

  • the edges for that particular vertex

  • begins in the Edges array.

  • So Offsets of i stores the offset

  • of where vertex i's edges start in the Edges array.

  • So in this example, vertex 0 has an offset of 0.

  • So its edges start at position 0 in the Edges array.

  • Vertex 1 has an offset of 4, so it

  • starts at index 4 in this Offsets array.

  • So with this representation, how can we

  • get the degree of a vertex?

  • So we're not storing the degree explicitly here.

  • Can we get the degree efficiently?

  • Yes.

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yeah, so you can get the degree of a vertex

  • just by looking at the difference

  • between the next offset and its own offset.

  • So for vertex 0, you can see that its degree is 4

  • because vertex 1's offset is 4, and vertex 0's offset is 0.

  • And similarly you can do that for all of the other vertices.

  • So what's the space usage of this representation?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Sorry, can you repeat?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yeah, so again, it's going to be order m plus n

  • because you need order n space for the Offsets array and order

  • m space for the Edges array.

  • You can also store values or weights on their edges.

  • One way to do this is to create an additional array of size m.

  • And then for edge i, you just store the weight

  • or the value in the i-th index of this additional array

  • that you created.

  • If you're always accessing the weight when you access an edge,

  • then it's actually better for a cache locality

  • to interleave the weights with the edge targets.

  • So instead of creating two arrays of size m,

  • you have one array of size 2m.

  • And every other entry is the weight.

  • And this improves cache locality because every time

  • you access an edge, its weight is going to be right next to it

  • in memory.

  • And it's going to likely be on the same cache line.

  • So that's one way to improve cache locality.

  • Any questions so far?

  • So let's look at some of the trade-offs

  • in these different graph representations

  • that we've looked at so far.

  • So here, I'm listing the storage costs

  • for each of these representations which

  • we already discussed.

  • This is also the cost for just scanning the whole graph in one

  • of these representations.

  • What's the cost of adding an edge in each

  • of these representations?

  • So for adjacency matrix, what's the cost of adding an edge?

  • AUDIENCE: Order 1.

  • JULIAN SHUN: So for adjacency matrix,

  • it's just order 1 to add an edge.

  • Because you have random access into this matrix,

  • so you just have to access to i, j-th entry

  • and flip the bit from 0 to 1.

  • What about for the edge list?

  • So assuming that the edge list is unordered,

  • so you don't have to keep the list in any sorted order.

  • Yeah.

  • AUDIENCE: I guess it's O of 1.

  • JULIAN SHUN: Yeah, so again, it's just O of 1

  • because you can just add it to the end of the edge list.

  • So that's a constant time.

  • What about for the adjacency list?

  • So actually, this depends on whether we're

  • using linked lists or arrays for the neighbor

  • lists of the vertices.

  • If we're using a linked list, adding an edge just

  • takes constant time because we can just

  • put it at the beginning of the linked list.

  • If we're using an array, then we actually

  • need to create a new array to make space for this edge

  • that we add.

  • And that's going to cost us a degree of v work

  • to do that because we have to copy all the existing edges

  • over to this new array and then add this new edge

  • to the end of that array.

  • Of course, you could amortize this cost

  • across multiple updates.

  • So if you run out of memory, you can

  • double the size of your array so you

  • don't have to create these new arrays too often.

  • But the cost for any individual addition

  • is still relatively expensive compared to, say, an edge list

  • or adjacency matrix.

  • And then finally, for the compressed sparse row format,

  • if you add an edge, in the worst case,

  • it's going to cost us order m plus n work.

  • Because we're going to have to reconstruct the entire Offsets

  • array and the entire Edges array in the worst case.

  • Because we have to put something in and then

  • shift-- in the Edges array, you have

  • to put something in and shift all of the values

  • to the right of that over by one location.

  • And then for the Offsets array, we

  • have to modify the offset for the particular vertex we're

  • adding an edge to and then the offsets for all

  • of the vertices after that.

  • So the compressed sparse row representation

  • is not particularly friendly to edge updates.

  • What about for deleting an edge from some vertex v?

  • So for adjacency matrix, again, it's

  • going to be constant time because you just

  • randomly access the correct entry

  • and flip the bit from 1 to 0.

  • What about for an edge list?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yeah, so for an edge list, in the worst case,

  • it's going to cost us order m work because the edges are not

  • in any sorted order.

  • So we have to scan through the whole thing in the worst case

  • to find the edge that we're trying to delete.

  • For adjacency list, it's going to take order degree of v work

  • because the neighbors are not sorted.

  • So we have to scan through the whole thing

  • to find this edge that we're trying to delete.

  • And then finally, for a compressed sparse row,

  • it's going to be order m plus n because we're

  • going to have to reconstruct the whole thing in the worst case.

  • What about finding all of the neighbors

  • of a particular vertex v?

  • What's the cost of doing this in the adjacency matrix?

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yes, so it's going to cost us

  • order n work to find all the neighbors

  • of a particular vertex because we just

  • scan the correct row in this matrix, the row

  • corresponding to vertex v. For the edge list,

  • we're going to have to scan the entire edge

  • list because it's not sorted.

  • So in the worst case, that's going to be order m.

  • For adjacency list, that's going to take order degree of v

  • because we can just find a pointer to the linked

  • list for that vertex in constant time.

  • And then we just traverse over the linked list.

  • And that takes order degree of v time.

  • And then finally, for compressed sparse row format,

  • it's also order degree of v because we have constant time

  • access into the appropriate location in the Edges array.

  • And then we can just read off the edges, which

  • are consecutive in memory.

  • So what about finding if a vertex w is a neighbor of v?

  • So I'll just give you the answer.

  • So for the adjacency matrix, it's

  • going to take constant time because again,

  • we just have to check the v-th row in the w-th column

  • and check if the bit is set there.

  • For edge list, we have to traverse the entire list

  • to see if the edge is there.

  • And then for adjacency list and compressed sparse row,

  • it's going to be order degree of v

  • because we just have to scan the neighbor list for that vertex.

  • So these are some graph representations.

  • But there are actually many other graph representations,

  • including variance of the ones that I've talked about here.

  • So, for example, for the adjacency,

  • I said you can either use a linked list or an array

  • to store the neighbor list.

  • But you can actually use a hybrid approach, where

  • you store the linked list, but each linked list node actually

  • stores more than one vertex.

  • So you can store maybe 16 vertices

  • in each linked list node.

  • And that gives us better cache locality.

  • So for the rest of this lecture, I'm

  • going to talk about algorithms that

  • are best implemented using the compressed sparse row format.

  • And this is because we're going to be

  • dealing with sparse graphs.

  • We're going to be looking at static algorithms, where we

  • don't have to update the graph.

  • If we do have to update the graph,

  • then CSR isn't a good choice.

  • But we're just going to be looking at static algorithms

  • today.

  • And then for all the algorithms that we'll be looking at,

  • we're going to need to scan over all the neighbors of a vertex

  • that we visit.

  • And CSR is very good for that because all

  • of the neighbors for a particular vertex

  • are stored contiguously in memory.

  • So any questions so far?

  • OK, I do want to talk about some properties

  • of real-world graphs.

  • So first, we're seeing graphs that are quite large today.

  • But actually, they're not too large.

  • So here are the sizes of some of the real-world graphs

  • out there.

  • So there is a Twitter network.

  • That's actually a snapshot of the Twitter network

  • from a couple of years ago.

  • It has 41 million vertices and 1.5 billion edges.

  • And you can store this graph in about 6.3 gigabytes of memory.

  • So you can probably store it in the main memory of your laptop.

  • The largest publicly available graph out

  • there now is this Common Crawl web graph.

  • It has 3.5 billion vertices and 128 billion edges.

  • So storing this graph requires a little

  • over 1/2 terabyte of memory.

  • It is quite a bit of memory.

  • But it's actually not too big because there are machines out

  • there with main memory sizes in the order of terabytes

  • of memory nowadays.

  • So, for example, you can rent 2-terabyte or 4-terabyte memory

  • instance on AWS, which you're using for your homework

  • assignments.

  • See if you have any leftover credits

  • at the end of the semester, and you

  • want to play around on this graph,

  • you can rent one of these terabyte machines.

  • Just remember to turn it off when

  • you're done because it's kind of expensive.

  • Another property of real-world graphs

  • is that they're quite sparse.

  • So m tends to be much less than n squared.

  • So most of the possible edges are not actually there.

  • And finally, the degree distributions of the vertices

  • can be highly skewed in many real-world graphs.

  • So here I'm plotting the degree on the x-axis

  • and the number of vertices with that particular degree

  • on the y-axis.

  • And we can see that it's highly skewed.

  • And, for example, in a social network, most of the people

  • would be on the left-hand side, so their degree is not

  • that high.

  • And then we have some very popular people

  • on the right-hand side, where their degree is very high,

  • but we don't have too many of those.

  • So this is what's known as a power law degree distribution.

  • And there have been various studies

  • that have shown that many real-world graphs have

  • approximately a power law degree distribution.

  • And mathematically, this means that the number

  • of vertices with degree d is proportional to d

  • to the negative p.

  • So negative p is the exponent.

  • And for many graphs, the value of p lies between 2 and 3.

  • And this power law degree distribution

  • does have implications when we're

  • trying to implement parallel algorithms to process

  • these graphs.

  • Because with graphs that have a skewed degree distribution,

  • you could run into load and balance issues.

  • If you just parallelize across the vertices,

  • the number of edges they have can vary significantly.

  • Any questions?

  • OK, so now let's talk about how we can implement a graph

  • algorithm.

  • And I'm going to talk about the breadth-first search algorithm.

  • So how many of you have seen breadth-first search before?

  • OK, so about half of you.

  • I did talk about breadth-first search in a previous lecture,

  • so I was hoping everybody would raise their hands.

  • OK, so as a reminder, in the BFS algorithm,

  • we're given a source vertex s, and we

  • want to visit the vertices in order of their distance

  • from the source s.

  • And there are many possible outputs

  • that we might care about.

  • One possible output is, we just want

  • to report the vertices in the order

  • that they were visited by the breadth-first search traversal.

  • So let's say we have this graph here.

  • And our source vertex is D. So what's

  • one possible order in which we can traverse these vertices?

  • Now, I should specify that we should

  • traverse this graph in a breadth-first search manner.

  • So what's the first vertex we're going to explore?

  • AUDIENCE: D.

  • JULIAN SHUN: D. So we're first going

  • to look at D because that's our source vertex.

  • The second vertex, we can actually

  • choose between B, C, and E because all we care about

  • is that we're visiting these vertices

  • in the order of their distance from the source.

  • But these three vertices are all of the same distance.

  • So let's just pick B, C, and then E.

  • And then finally, I'm going to visit

  • vertex A, which has a distance of 2 from the source.

  • So this is one possible solution.

  • There are other possible solutions

  • because we could have visited E before we visited B and so on.

  • Another possible output that we might care about

  • is we might want to report the distance from each vertex

  • to the source vertex s.

  • So in this example here are the distances.

  • So D has a distance of 0; B,C, and E all have a distance of 1;

  • and A has a distance of 2.

  • We might also want to generate a breadth-first search tree where

  • each vertex in the tree has a parent which

  • is a neighbor in the previous level

  • of the breadth-first search.

  • Or in other words, the parent should

  • have a distance of 1 less than that vertex itself.

  • So here's an example of a breadth-first search tree.

  • And we can see that each of the vertices

  • has a parent whose breadth-first search distance is 1 less

  • than itself.

  • So the algorithms that I'm going to be talking about today

  • will generate the distances as well as the BFS tree.

  • And BFS actually has many applications.

  • So it's used as a subroutine in betweenness

  • centrality, which is a very popular graph mining

  • algorithm used to rank the importance of nodes

  • in a network.

  • And the importance of nodes here corresponds

  • to how many shortest paths go through that node.

  • Other applications include eccentricity estimation,

  • maximum flows.

  • Some max flow algorithms use BFS as a subroutine.

  • You can use BFS to crawl the web,

  • do cycle detection, garbage collection, and so on.

  • So let's now look at a serial BFS algorithm.

  • And here, I'm just going to show the pseudocode.

  • So first, we're going to initialize the distances

  • to all INFINITY.

  • And we're going to initialize the parents to be NIL.

  • And then we're going to create queue data structure.

  • We're going to set the distance of the root

  • to be 0 because the root has a distance of 0 to itself.

  • And then we're going to place the root onto this queue.

  • And then, while the queue is not empty,

  • we're going to dequeue the first thing in the queue.

  • We're going to look at all the neighbors of the current vertex

  • that we dequeued.

  • And for each neighbor, we're going

  • to check if its distance is INFINITY.

  • If the distance is INFINITY, that

  • means we haven't explored that neighbor yet.

  • So we're going to go ahead and explore it.

  • And we do so by setting its distance value

  • to be the current vertex's distance plus 1.

  • We're going to set the parent of that neighbor

  • to be the current vertex.

  • And then we'll place the neighbor onto the queue.

  • So it's some pretty simple algorithm.

  • And we're just going to keep iterating in this while loop

  • until there are no more vertices left in the queue.

  • So what's the work of this algorithm in terms of n and m?

  • So how much work are we doing per edge?

  • Yes.

  • AUDIENCE: [INAUDIBLE]

  • JULIAN SHUN: Yeah, so assuming that the enqueue and dequeue

  • operators are constant time, then

  • we're doing constant amount of work per edge.

  • So summed across all edges, that's going to be order m.

  • And then we're also doing a constant amount

  • of work per vertex because we have to basically place it

  • onto the queue and then take it off the queue,

  • and then also initialize their value.

  • So the overall work is going to be order m plus n.

  • OK, so let's now look at some actual code

  • to implement the serial BFS algorithm using

  • the compressed sparse row format.

  • So first, I'm going to initialize two arrays--

  • parent and queue.

  • And these are going to be integer arrays of size n.

  • I'm going to initialize all of the parent entries

  • to be negative 1.

  • I'm going to place a source vertex onto the queue.

  • So it's going to appear at queue of 0, that's

  • the beginning of the queue.

  • And then I'll set the parent of the source vertex

  • to be the source itself.

  • And then I also have two integers

  • that point to the front and the back of the queue.

  • So initially, the front of the queue is at position 0,

  • and the back is at position 1.

  • And then while the queue is not empty--

  • and I can check that by checking if q_front is not

  • equal to q_back--

  • then I'm going to dequeue the first vertex in my queue.

  • I'm going to set current to be that vertex.

  • And then I'll increment q_front.

  • And then I'll compute the degree of that vertex, which

  • I can do by looking at the difference

  • between consecutive offsets.

  • And I also assume that Offsets of n

  • is equal to m, just to deal with the last vertex

  • And then I'm going to loop through all of the neighbors

  • for the current vertex.

  • And to access each neighbor, what I do

  • is I go into the Edges array.

  • And I know that my neighbors start at Offsets of current.

  • And therefore, to get the i-th neighbor,

  • I just do Offsets of current plus i.

  • That's my index into the Edges array.

  • Now I'm going to check if my neighbor has been explored yet.

  • And I can check that by checking if parent of neighbor

  • is equal to negative 1.

  • If it is, that means I haven't explored it yet.

  • And then I'll set a parent of neighbor to be current.

  • And then I'll place the neighbor onto the back of the queue

  • and increment q_back.

  • And I'm just going to keep repeating this while loop

  • until it becomes empty.

  • And here, I'm only generating the parent pointers.

  • But I could also generate the distances

  • if I wanted to with just a slight modification

  • of this code.

  • So any questions on how this code works?

  • OK, so here's a question.

  • What's the most expensive part of the code?

  • Can you point to one particular line here

  • that is the most expensive?

  • Yes.

  • AUDIENCE: I'm going to guess the [INAUDIBLE] that's gonna be all

  • over the place in terms of memory locations--

  • ngh equals Edges.

  • JULIAN SHUN: OK, so actually, it turns out

  • that that's not the most expensive part of this code.

  • But you're close.

  • So anyone have any other ideas?

  • Yes.

  • AUDIENCE: Is it looking up the parent array?

  • JULIAN SHUN: Yes, so it turns out that this line here,

  • where we're accessing parent of neighbor,

  • that turns out to be the most expensive.

  • Because whenever we access this parent array,

  • the neighbor can appear anywhere in memory.

  • So that's going to be a random access.

  • And if the parent array doesn't fit in our cache,

  • then that's going to cost us a cache miss almost every time.

  • This Edges array is actually mostly accessed sequentially.

  • Because for each vertex, all of its edges

  • are stored contiguously in memory,

  • we do have one random access into the Edges array per vertex

  • because we have to look up the starting location

  • for that vertex.

  • But it's not 1 per edge, unlike this check of the parent array.

  • That occurs for every edge.

  • So does that make sense?

  • So let's do a back-of-the-envelope

  • calculation to figure out how many cache misses we would

  • incur, assuming that we started with a cold cache.

  • And we also assume that n is much larger

  • than the size of the cache, so we can't fit

  • any of these arrays into cache.

  • We'll assume that a cache line has 64 bytes,

  • and integers are 4 bytes each.

  • So let's try to analyze this.

  • So the initialization will cost us n/16 cache misses.

  • And the reason here is that we're initializing this array

  • sequentially.

  • So we're accessing contiguous locations.

  • And this can take advantage of spatial locality.

  • On each cache line, we can fit 16 of the integers.

  • So overall, we're going to need n/16 cache misses just

  • to initialize this array.

  • We also need n/16 cache misses across the entire algorithm

  • to dequeue the vertex from the front of the queue.

  • Because again, this is going to be a sequential access

  • into this queue array.

  • And across all vertices, that's going to be n/16

  • cache misses because we can fit 16 integers on a cache line.

  • To compute the degree here, that's

  • going to take n cache misses overall.

  • Because each of these accesses to Offsets

  • array is going to be a random access.

  • Because we have no idea what the value of current here is.

  • It could be anything.

  • So across the entire algorithm, we're

  • going to need n cache misses to access this Offsets array.

  • And then to access this Edges array,

  • I claim that we're going to need at most 2n plus m/16 cache

  • misses.

  • So does anyone see where that bound comes from?

  • So where does the m/16 come from?

  • Yeah.

  • AUDIENCE: You have to access that at least once for an edge.

  • JULIAN SHUN: Right, so you have to pay m/16 because you're

  • accessing every edge once.

  • And you're accessing the Edges contiguously.

  • So therefore, across all Edges, that's

  • going to take m/16 cache misses.

  • But we also have to add 2n.

  • Because whenever we access the Edges for a particular vertex,

  • the first cache line might not only

  • contain that vertex's edges.

  • And similarly, the last cache line

  • that we access might also not just contain

  • that vertex's edges.

  • So therefore, we're going to waste the first cache

  • line and the last cache line in the worst case for each vertex.

  • And summed cross all vertices, that's going to be 2n.

  • So this is the upper bound, 2n plus m/16.

  • Accessing this parent array, that's

  • going to be a random access every time.

  • So we're going to incur a cache miss

  • in the worst case every time.

  • So summed across all edge accesses,

  • that's going to be m cache misses.

  • And then finally, we're going to pay

  • n/16 cache misses to enqueue the neighbor onto the queue

  • because these are sequential accesses.

  • So in total, we're going to incur at most 51/16 n

  • plus 17/16 16 m cache misses.

  • And if m is greater than 3n, then the second term

  • here is going to dominate.

  • And m is usually greater than 3n in most real-world graphs.

  • And the second term here is dominated by this random access

  • into the parent array.

  • So let's see if we can optimize this code so that we

  • get better cache performance.

  • So let's say we could fit a bit vector of size n into cache.

  • But we couldn't fit the entire parent array into cache.

  • What can we do to reduce the number of cache misses?

  • So does anyone have any ideas?

  • Yeah.

  • AUDIENCE: Is bitvector to keep track of which

  • vertices of other parents then [INAUDIBLE]??

  • JULIAN SHUN: Yeah, so that's exactly correct.

  • So we're going to use a bit vector

  • to store whether the vertex has been explored yet or not.

  • So we only need 1 bit for that.

  • We're not storing the parent ID in this bit vector.

  • We're just storing a bit to say whether that vertex has

  • been explored yet or not.

  • And then, before we check this parent array,

  • we're going to first check the bit vector

  • to see if that vertex has been explored yet.

  • And if it has been explored yet, we

  • don't even need to access this parent array.

  • If it hasn't been explored, then we

  • won't go ahead and access the parent entry of the neighbor.

  • But we only have to do this one time

  • for each vertex in the graph because we can only

  • visit each vertex once.

  • And therefore, we can reduce the number of cache

  • misses from m down to n.

  • So overall, this might improve the number of cache misses.

  • In fact, it does if the number of edges

  • is large enough relative to the number of vertices.

  • However, you do have to do a little bit more computation

  • because you have to do bit vector manipulation to check

  • this bit vector and then also to set the bit vector when

  • you explore a neighbor.

  • So here's the code using the bit vector optimization.

  • So here, I'm initializing this bit vector called visited.

  • It's of size, approximately, n/32.

  • And then I'm setting all of the bits

  • to 0, except for the source vertex, where

  • I'm going to set its bit to 1.

  • And I'm doing this bit calculation here

  • to figure out the bit for the source vertex.

  • And then now, when I'm trying to visit a neighbor,

  • I'm first going to check if the neighbor is visited

  • by checking this bit array.

  • And I can do this using this computation here--

  • AND visited of neighbor over 32, by this mask--

  • 1 left shifted by neighbor mod 32.

  • And if that's false, that means the neighbor

  • hasn't been visited yet.

  • So I'll go inside this IF clause.

  • And then I'll set the visited bit

  • to be true using this statement here.

  • And then I do the same operations as I did before.

  • It turns out that this version is

  • faster for large enough values of m

  • relative to n because you reduce the number of cache

  • misses overall.

  • You still have to do this extra computation here,

  • this bit manipulation.

  • But if m is large enough, then the reduction

  • in number of cache misses outweighs

  • the additional computation that you have to do.

  • Any questions?

  • OK, so that was a serial implementation

  • of breadth-first search.

  • Now let's look at a parallel implementation.

  • So I'm first going to do an animation

  • of how a parallel breadth-first search algorithm would work.

  • The parallel reference search algorithm

  • is going to operate on frontiers,

  • where the initial frontier contains just a source vertex.

  • And on every iteration, I'm going

  • to explore all of the vertices on the frontier

  • and then place any unexplored neighbors

  • onto the next frontier.

  • And then I move on to the next frontier.

  • So in the first iteration, I'm going

  • to mark the source vertex as explored,

  • set its distance to be 0, and then place

  • the neighbors of that source vertex onto the next frontier.

  • In the next iteration, I'm going to do the same thing, set

  • these distances to 1.

  • I also am going to generate a parent pointer

  • for each of these vertices.

  • And this parent should come from the previous frontier,

  • and it should be a neighbor of the vertex.

  • And here, there's only one option,

  • which is the source vertex.

  • So I'll just pick that as the parent.

  • And then I'm going to place the neighbors

  • onto the next frontier again, mark those as explored,

  • set their distances, and generate a parent

  • pointer again.

  • And notice here, when I'm generating these parent

  • pointers, there's actually more than one choice

  • for some of these vertices.

  • And this is because there are multiple vertices

  • on the previous frontier.

  • And some of them explored the same neighbor

  • on the current frontier.

  • So a parallel implementation has to be

  • aware of this potential race.

  • Here, I'm just picking an arbitrary parent.

  • So as we see here, you can process each

  • of these frontiers in parallel.

  • So you can parallelize over all of the vertices on the frontier

  • as well as all of their outgoing edges.

  • However, you do need to process one frontier before you

  • move on to the next one in this BFS algorithm.

  • And a parallel implementation has

  • to be aware of potential races.

  • So as I said earlier, we could have multiple vertices

  • on the frontier trying to visit the same neighbors.

  • So somehow, that has to be resolved.

  • And also, the amount of work on each frontier

  • is changing throughout the course of the algorithm.

  • So you have to be careful with load balancing.

  • Because you have to make sure that the amount of work

  • each processor has to do is about the same.

  • If you use Cilk to implement this,

  • then load balancing doesn't really become a problem.

  • So any questions on the BFS algorithm

  • before I go over the code?

  • OK, so here's the actual code.

  • And here I'm going to initialize these four arrays, so

  • the parent array, which is the same as before.

  • I'm going to have an array called frontier, which

  • stores the current frontier.

  • And then I'm going to have an array

  • called frontierNext, which is a temporary array

  • that I use to store the next frontier of the BFS.

  • And then also I have an array called degrees.

  • I'm going to initialize all of the parent entries

  • to be negative 1.

  • I do that using a cilk_for loop.

  • I'm going to place the source vertex at the 0-th index

  • of the frontier.

  • I'll set the frontierSize to be 1.

  • And then I set the parent of the source to be the source itself.

  • While the frontierSize is greater than 0,

  • that means I still have more work to do.

  • I'm going to first iterate over all

  • of the vertices on my frontier in parallel using a cilk_for

  • loop.

  • And then I'll set the i-th entry of the degrees array

  • to be the degree of the i-th vertex on the frontier.

  • And I can do this just using the difference

  • between consecutive offsets.

  • And then I'm going to perform a prefix sum on this degrees

  • array.

  • And we'll see in a minute why I'm doing this prefix sum.

  • But first of all, does anybody recall what prefix sum is?

  • So who knows what prefix sum is?

  • Do you want to tell us what it is?

  • AUDIENCE: That's the sum array where index i is the sum of

  • [INAUDIBLE].

  • JULIAN SHUN: Yeah, so prefix sum--

  • so here I'm going to demonstrate this with an example.

  • So let's say this is our input array.

  • The output of this array would store for each location

  • the sum of everything before that location in the input

  • array.

  • So here we see that the first position has a value of 0

  • because a sum of everything before it is 0.

  • There's nothing before it in the input.

  • The second position has a value of 2

  • because the sum of everything before it is just

  • the first location.

  • The third location has a value of 6

  • because the sum of everything before it is 2

  • plus 4, which is 6, and so on.

  • So I believe this was on one of your homework assignments.

  • So hopefully, everyone knows what prefix sum is.

  • And later on, we'll see how we use

  • this to do the parallel breadth-first search.

  • OK, so I'm going to do a prefix sum on this degrees array.

  • And then I'm going to loop over my frontier again in parallel.

  • I'm going to let v be the i-th vertex on the frontier.

  • Index is going to be equal to degrees of i.

  • And then my degree is going to be Offsets of v

  • plus 1 minus Offsets of v.

  • Now I'm going to loop through all v's neighbors.

  • And here I just have a serial for loop.

  • But you could actually parallelize this for loop.

  • It turns out that if the number of iterations in the for loop

  • is small enough, there's additional overhead

  • to making this parallel, so I just made it serial for now.

  • But you could make it parallel.

  • To get the neighbor, I just index into this Edges array.

  • I look at Offsets of v plus j.

  • Then now I'm going to check if the neighbor has

  • been explored yet.

  • And I can check if parent of neighbor

  • is equal to negative 1.

  • So that means it hasn't been explored yet, so I'm

  • going to try to explore it.

  • And I do so using a compare-and-swap.

  • I'm going to try to swap in the value of v

  • with the original value of negative 1

  • in parent of neighbor.

  • And the compare-and-swap is going

  • to return true if it was successful and false otherwise.

  • And if it returns true, that means

  • this vertex becomes the parent of this neighbor.

  • And then I'll place the neighbor on

  • to frontierNext at this particular index--

  • index plus j.

  • And otherwise, I'll set a negative 1 at that location.

  • OK, so let's see why I'm using index plus j here.

  • So here's how frontierNext is organized.

  • So each vertex on the frontier owns

  • a subset of these locations in the frontierNext array.

  • And these are all contiguous memory locations.

  • And it turns out that the starting location

  • for each of these vertices in this frontierNext array

  • is exactly the value in this prefix sum array up here.

  • So vertex 1 has its first location at index 0.

  • Vertex 2 has its first location at index 2.

  • Vertex 3 has its first location at index 6, and so on.

  • So by using a prefix sum, I can guarantee

  • that all of these vertices have a disjoint subarray

  • in this frontierNext array.

  • And then they can all write to this frontierNext array

  • in parallel without any races.

  • And index plus j just gives us the right location

  • to write to in this array.

  • So index is the starting location,

  • and then j is for the j-th neighbor.

  • So here is one potential output after we write

  • to this frontierNext array.

  • So we have some non-negative values.

  • And these are vertices that we explored in this iteration.

  • We also have some negative 1 values.

  • And the negative 1 here means that either the vertex has

  • already been explored in a previous iteration,

  • or we tried to explore it in the current iteration,

  • but somebody else got there before us.

  • Because somebody else is doing the compare-and-swap

  • at the same time, and they could have finished before we did,

  • so we failed on the compare-and-swap.

  • So we don't actually want these negative 1 values, so we're

  • going to filter them out.

  • And we can filter them out using a prefix sum again.

  • And this is going to give us a new frontier.

  • And we'll set the frontierSize equal to the size

  • of this new frontier.

  • And then we repeat this while loop

  • until there are no more vertices on the frontier.

  • So any questions on this parallel BFS algorithm?

  • Yeah.

  • AUDIENCE: Can you go over like the last [INAUDIBLE]??

  • JULIAN SHUN: Do you mean the filter out?

  • AUDIENCE: Yeah.

  • JULIAN SHUN: Yeah, so what you can do

  • is, you can create another array, which stores a 1

  • in location i if that location is not a negative 1 and 0

  • if it is a negative 1.

  • Then you do a prefix sum on that array,

  • which gives us unique offsets into an output array.

  • So then everybody just looks at the prefix sum array there.

  • And then it writes to the output array.

  • So it might be easier if I tried to draw this on the board.

  • OK, so let's say we have an array of size 5 here.

  • So what I'm going to do is I'm going

  • to generate another array which stores

  • a 1 if the value in the corresponding location

  • is not a negative 1 and 0 otherwise.

  • And then I do a prefix sum on this array here.

  • And this gives me 0, 1, 1, 2, and 2.

  • And now each of these values that are not negative 1,

  • they can just look up the corresponding index

  • in this output array.

  • And this gives us a unique index into an output array.

  • So this element will write to position 0,

  • this element would write to position 1,

  • and this element would write to position 2 in my final output.

  • So this would be my final frontier.

  • Does that make sense?

  • OK, so let's now analyze the working span

  • of this parallel BFS algorithm.

  • So a number of iterations required by the BFS algorithm

  • is upper-bounded by the diameter D of the graph.

  • And the diameter of a graph is just the maximum shortest

  • path between any pair of vertices in the graph.

  • And that's an upper bound on the number of iterations

  • we need to do.

  • Each iteration is going to take a log m

  • span for the clik_for loops, the prefix sum, and the filter.

  • And this is also assuming that the inner loop

  • is parallelized, the inner loop over the neighbors of a vertex.

  • So to get the span, we just multiply these two terms.

  • So we get theta of D times log m span.

  • What about the work?

  • So to compute the work, we have to figure out

  • how much work we're doing per vertex and per edge.

  • So first, notice that the sum of the frontier

  • sizes across entire algorithm is going

  • to be n because each vertex can be on the frontier at most

  • once.

  • Also, each edge is going to be traversed exactly once.

  • So that leads to m total edge visits.

  • On each iteration of the algorithm,

  • we're doing a prefix sum.

  • And the cost of this prefix sum is

  • going to be proportional to the frontier size.

  • So summed across all iterations, the cost of the prefix

  • sum is going to be theta of n.

  • We also have to do this filter.

  • But the work of the filter is proportional to the number

  • of edges traversed in that iteration.

  • And summed across all iterations, that's

  • going to give theta of m total.

  • So overall, the work is going to be

  • theta of n plus m for this parallel BFS algorithm.

  • So this is a work-efficient algorithm.

  • The work matches out the serial algorithm.

  • Any questions on the analysis?

  • OK, so let's look at how this parallel BFS

  • algorithm runs in practice.

  • So here, I ran some experiments on a random graph

  • with 10 million vertices and 100 million edges.

  • And the edges were randomly generated.

  • And I made sure that each vertex had 10 edges.

  • I ran experiments on a 40-core machine

  • with 2-way hyperthreading.

  • Does anyone know what hyperthreading is?

  • Yeah, what is it?

  • AUDIENCE: It's when you have like one CPU core that

  • can execute two instruction screens at the same time

  • so it can [INAUDIBLE] high number latency.

  • JULIAN SHUN: Yeah, so that's a great answer.

  • So hyperthreading is an Intel technology

  • where for each physical core, the operating system actually

  • sees it as two logical cores.

  • They share many of the same resources,

  • but they have their own registers.

  • So if one of the logical cores stalls on a long latency

  • operation, the other logical core

  • can use the shared resources and hide some of the latency.

  • OK, so here I am plotting the speedup

  • over the single-threaded time of the parallel algorithm

  • versus the number of threads.

  • So we see that on 40 threads, we get

  • a speedup of about 22 or 23X.

  • And when we turn on hyperthreading

  • and use all 80 threads, the speedup is about 32 times

  • on 40 cores.

  • And this is actually pretty good for a parallel graph algorithm.

  • It's very hard to get very good speedups

  • on these irregular graph algorithms.

  • So 32X on 40 cores is pretty good.

  • I also compared this to the serial BFS algorithm

  • because that's what we ultimately

  • want to compare against.

  • So we see that on 80 threads, the speedup over the serial BFS

  • is about 21, 22X.

  • And the serial BFS is 54% faster than the parallel BFS

  • on one thread.

  • This is because it's doing less work than the parallel version.

  • The parallel version has to do actual work with the prefix

  • sum in the filter, whereas the serial version doesn't

  • have to do that.

  • But overall, the parallel implementation

  • is still pretty good.

  • OK, questions?

  • So a couple of lectures ago, we saw this slide here.

  • So Charles told us never to write

  • nondeterministic parallel programs because it's

  • very hard to debug these programs and hard to reason

  • about them.

  • So is there nondeterminism in this BFS code

  • that we looked at?

  • AUDIENCE: You have nondeterminism

  • in the compare-and-swap.

  • JULIAN SHUN: Yeah, so there's nondeterminism

  • in the compare-and-swap.

  • So let's go back to the code.

  • So this compare-and-swap here, there's

  • a race there because we get multiple vertices trying

  • to write to the parent entry of the neighbor at the same time.

  • And the one that wins is nondeterministic.

  • So the BFS tree that you get at the end is nondeterministic.

  • OK, so let's see how we can try to fix this nondeterminism.

  • OK so, as we said, this is a line

  • that causes the nondeterminism.

  • It turns out that we can actually make the output BFS

  • tree, be deterministic by going over

  • the outgoing edges in each iteration in two phases.

  • So how this works is that in the first phase,

  • the vertices on the frontier are not actually

  • going to write to the parent array.

  • Or they are going to write, but they're

  • going to be using this writeMin operator.

  • And the writeMin operator is an atomic operation

  • that guarantees that we have concurrent writes

  • to the same location.

  • The smallest value gets written there.

  • So the value that gets written there

  • is going to be deterministic.

  • It's always going to be the smallest

  • one that tries to write there.

  • Then in the second phase, each vertex

  • is going to check for each neighbor

  • whether a parent of neighbor is equal to v. If it is,

  • that means it was the vertex that successfully wrote

  • to parent of neighbor in the first phase.

  • And therefore, it's going to be responsible for placing

  • this neighbor onto the next frontier.

  • And we're also going to set parent of neighbor

  • to be negative v. This is just a minor detail.

  • And this is because when we're doing this writeMin operator,

  • we could have a future iteration where a lower vertex tries

  • to visit the same vertex that we already explored.

  • But if we set this to a negative value,

  • we're only going to be writing non-negative values

  • to this location.

  • So the writeMin on a neighbor that has already been explored

  • would never succeed.

  • OK, so the final BFS tree that's generated by this code

  • is always going to be the same every time you run it.

  • I want to point out that this code is still

  • notdeterministic with respect to the order

  • in which individual memory locations get updated.

  • So you still have a deterministic race here

  • in the writeMin operator.

  • But it's still better than a nondeterministic code

  • in that you always get the same BFS tree.

  • So how do you actually implement the writeMin operation?

  • So it turns out you can implement this using

  • a loop with a compare-and-swap.

  • So writeMin takes as input two arguments--

  • the memory address that we're trying to update

  • and the new value that we want to write to that address.

  • We're first going to set oldval equal to the value

  • at that memory address.

  • And we're going to check if newval is less than oldval.

  • If it is, then we're going to attempt

  • to do a compare-and-swap at that location,

  • writing newval into that address if its initial value

  • was oldval.

  • And if that succeeds, then we return.

  • Otherwise, we failed.

  • And that means that somebody else came in the meantime

  • and changed the value there.

  • And therefore, we have to reread the old value

  • at the memory address.

  • And then we repeat.

  • And there are two ways that this writeMin operator could finish.

  • One is if the compare-and-swap was successful.

  • The other one is if newval is greater than

  • or equal to oldval.

  • In that case, we no longer have to try to write anymore

  • because the value that's there is already smaller than what

  • we're trying to write.

  • So I implemented an optimized version

  • of this deterministic parallel BFS code

  • and compared it to the nondeterministic version.

  • And it turns out on 32 cores, it's

  • only a little bit slower than the nondeterministic version.

  • So it's about 5% to 20% slower on a range of different input

  • graphs.

  • So this is a pretty small price to pay for determinism.

  • And you get many nice benefits, such as ease

  • of debugging and ease of reasoning about the performance

  • of your code.

  • Any questions?

  • OK, so let me talk about another optimization

  • for breadth-first search.

  • And this is called the direction optimization.

  • And the idea is motivated by how the sizes of the frontiers

  • change in a typical BFS algorithm over time.

  • So here I'm plotting the frontier size

  • on the y-axis in log scale.

  • And the x-axis is the iteration number.

  • And on the left, we have a random graph, on the right,

  • we have a parallel graph.

  • And we see that the frontier size actually

  • grows pretty rapidly, especially for the power law graph.

  • And then it drops pretty rapidly.

  • So this is true for many of the real-world graphs

  • that we see because many of them look like power law graphs.

  • And in the BFS algorithm, most of the work

  • is done when the frontier is relatively large.

  • So most of the work is going to be

  • done in these middle iterations where

  • the frontier is very large.

  • And it turns out that there are two ways

  • to do breadth-first search.

  • One way is the traditional way, which

  • I'm going to refer to as the top-down method.

  • And this is just what we did before.

  • We look at the frontier vertices,

  • and explore all of their outgoing neighbors,

  • and mark any of the unexplored ones as explored,

  • and place them on to the next frontier.

  • But there's actually another way to do breadth-first search.

  • And this is known as the bottom-up method.

  • And in the bottom-up method, I'm going

  • to look at all of the vertices in the graph that

  • haven't been explored yet, and I'm

  • going to look at their incoming edges.

  • And if I find an incoming edge that's on the current frontier,

  • I can just say that that incoming neighbor is my parent.

  • And I don't even need to look at the rest

  • of my incoming neighbors.

  • So in this example here, vertices 9 through 12,

  • when they loop through their incoming edges,

  • they found incoming neighbor on the frontier,

  • and they chose that neighbor as their parent.

  • And they get marked as explored.

  • And we can actually save some edge traversals here because,

  • for example, if you look at vertex 9,

  • and you imagine the edges being traversed

  • in a top-to-bottom manner, then vertex 9 is only

  • going to look at its first incoming edge

  • and find the incoming neighbors on the frontier.

  • So it doesn't even need to inspect

  • the rest of the incoming edges because all

  • we care about finding is just one parent in the BFS tree.

  • We don't need to find all of the possible parents.

  • In this example here, vertices 13 through 15 actually ended up

  • wasting work because they looked at all of their incoming edges.

  • And none of the incoming neighbors are on the frontier.

  • So they don't actually find a neighbor.

  • So the bottom-up approach turns out

  • to work pretty well when the frontier is large

  • and many vertices have been already explored.

  • Because in this case, you don't have to look at many vertices.

  • And for the ones that you do look at,

  • when you scan over their incoming edges,

  • it's very likely that early on, you'll

  • find a neighbor that is on the current frontier,

  • and you can skip a bunch of edge traversals.

  • And the top-down approach is better

  • when the frontier is relatively small.

  • And in a paper by Scott Beamer in 2012,

  • he actually studied the performance

  • of these two approaches in BFS.

  • And this plot here plots the running time

  • versus the iteration number for a power law graph

  • and compares the performance of the top-down and bottom-up

  • approach.

  • So we see that for the first two steps,

  • the top-down approach is faster than the bottom-up approach.

  • But then for the next couple of steps,

  • the bottom-up approach is faster than a top-down approach.

  • And then when we get to the end, the top-down approach

  • becomes faster again.

  • So the top-down approach, as I said,

  • is more efficient for small frontiers,

  • whereas a bottom-up approach is more

  • efficient for large frontiers.

  • Also, I want to point out that in the top-down approach, when

  • we update the parent array, that actually has to be atomic.

  • Because we can have multiple vertices trying

  • to update the same neighbor.

  • But in a bottom-up approach, the update to the parent array

  • doesn't have to be atomic.

  • Because we're scanning over the incoming neighbors

  • of any particular vertex v serially.

  • And therefore, there can only be one processor

  • that's writing to parent of v.

  • So we choose between these two approaches based

  • on the size of the frontier.

  • We found that a threshold of a frontier size of about n/20

  • works pretty well in practice.

  • So if the frontier has more than n/20 vertices,

  • we used a bottom up approach.

  • And otherwise, we used a top-down approach.

  • You can also use more sophisticated thresholds,

  • such as also considering the sum of out-degrees,

  • since the actual work is dependent on the sum

  • of out-degrees of the vertices on the frontier.

  • You can also use different thresholds

  • for going from top-down to bottom-up and then

  • another threshold for going from bottom-up back to top-down.

  • And in fact, that's what the original paper did.

  • They had two different thresholds.

  • We also need to generate the inverse graph

  • or the transposed graph if we're using this method

  • if the graph is directed.

  • Because if the graph is directed,

  • in the bottom-up approach, we actually

  • need to look at the incoming neighbors, not

  • the outgoing neighbors.

  • So if the graph wasn't already symmetrized,

  • then we have to generate both the incoming neighbors

  • and outgoing neighbors for each vertex.

  • So we can do that as a pre-processing step.

  • Any questions?

  • OK, so how do we actually represent the frontier?

  • So one way to represent the frontier

  • is just use a sparse integer array,

  • which is what we did before.

  • Another way to do this is to use a dense array.

  • So, for example, here I have an array of bytes.

  • The array is of size n, where n is the number of vertices.

  • And I have a 1 in position i if vertex i

  • is on the frontier and 0 otherwise.

  • I can also use a bit vector to further compress this

  • and then use additional bit level operations to access it.

  • So for the top-down approach, a sparse representation

  • is better because the top-down approach usually

  • deals with small frontiers.

  • And if we use a sparse array, we only

  • have to do work proportional to the number of vertices

  • on the frontier.

  • And then in the bottom-up approach,

  • it turns out that dense representation is better

  • because we're looking at most of the vertices anyways.

  • And then we need to switch between these two methods based

  • on the approach that we're using.

  • So here's some performance numbers comparing the three

  • different modes of traversal.

  • So we have bottom-up, top-down, and then

  • the direction optimizing approach

  • using a threshold of n/20.

  • First of all, we see that the bottom-up approach

  • is the slowest for both of these graphs.

  • And this is because it's doing a lot of wasted work

  • in the early iterations.

  • We also see that the direction optimizing approach is always

  • faster than both the top-down and the bottom-up approach.

  • This is because if we switch to the bottom-up approach

  • at an appropriate time, then we can

  • save a lot of edge traversals.

  • And, for example, you can see for the power law graph,

  • the direction optimizing approach

  • is almost three times faster than the top-down approach.

  • The benefits of this approach are highly

  • dependent on the input graph.

  • So it works very well for power law and random graphs.

  • But if you have graphs where the frontier size is always small,

  • such as a grid graph or a road network,

  • then you would never use a bottom-up approach.

  • So this wouldn't actually give you any performance gains.

  • Any questions?

  • So it turns out that this direction optimizing

  • idea is more general than just breadth-first search.

  • So a couple years ago, I developed

  • this framework called Ligra, where I generalized

  • the direction optimizing idea to other graph algorithms,

  • such as betweenness centrality, connected components, sparse

  • PageRank, shortest paths, and so on.

  • And in the Ligra framework, we have an EDGEMAP operator

  • that chooses between a sparse implementation

  • and a dense implementation based on the size of the frontier.

  • So the sparse here corresponds to the top-down approach.

  • And dense corresponds to the bottom-up approach.

  • And it turns out that using this direction optimizing

  • idea for these other applications

  • also gives you performance gains in practice.

  • OK, so let me now talk about another optimization, which

  • is graph compression.

  • And the goal here is to reduce the amount of memory usage

  • in the graph algorithm.

  • So recall, this was our CSR representation.

  • And in the Edges array, we just stored

  • the values of the target edges.

  • Instead of storing the actual targets,

  • we can actually do better by first sorting the edges so

  • that they appear in non-decreasing order

  • and then just storing the differences

  • between consecutive edges.

  • And then for the first edge for any particular vertex,

  • we'll store the difference between the target

  • and the source of that edge.

  • So, for example, here, for vertex 0,

  • the first edge is going to have a value of 2

  • because we're going to take the difference between the target

  • and the source.

  • So 2 minus 0 is 2.

  • Then for the next edge, we're going

  • to take the difference between the second edge

  • and the first edge, so 7 minus 2, which is 5.

  • And then similarly we do that for all of the remaining edges.

  • Notice that there are some negative values here.

  • And this is because the target is smaller than the source.

  • So in this example, 1 is smaller than 2.

  • So if you do 1 minus 2, you get a negative--

  • negative 1.

  • And this can only happen for the first edge

  • for any particular vertex because for all

  • the other edges, we're encoding the difference

  • between that edge and the previous edge.

  • And we already sorted these edges

  • so that they appear in non-decreasing order.

  • OK, so this compressed edges array

  • will typically contain smaller values

  • than this original edges array.

  • So now we want to be able to use fewer bits

  • to represent these values.

  • We don't want to use 32 or 64 bits like we did before.

  • Otherwise, we wouldn't be saving any space.

  • So one way to reduce the space usage

  • is to store these values using what's called a variable length

  • code or a k-bit code.

  • And the idea is to encode each value in chunks of k bits,

  • where for each chunk, we use k minus 1 bits for the data and 1

  • bit as the continue bit.

  • So for example, let's encode the integer 401

  • using 8-bit or byte codes.

  • So first, we're going to write this value out in binary.

  • And then we're going to take the bottom 7 bits,

  • and we're going to place that into the data

  • field of the first chunk.

  • And then in the last bit of this chunk,

  • we're going to check if we still have any more

  • bits that we need to encode.

  • And if we do, then we're going to set a 1 in the continue bit

  • position.

  • And then we create another chunk.

  • We'll replace the next 7 bits into the data

  • field of that chunk.

  • And then now we're actually done encoding this integer value.

  • So we can place a 0 in the continue bit.

  • So that's how the encoding works.

  • And decoding is just doing this process backwards.

  • So you read chunks until you find a chunk with a 0

  • continue bit.

  • And then you shift all of the data values

  • left accordingly and sum them together

  • to reconstruct the integer value that you encoded.

  • One performance issue that might occur here

  • is that when you're decoding, you

  • have to check this continue bit for every chunk

  • and decide what to do based on that continue bit.

  • And this is actually unpredictable branch.

  • So you can suffer from branch mispredictions

  • from checking this continue bit.

  • So one way you can optimize this is to get rid of these

  • continue bits.

  • And the idea here is to first figure out

  • how many bytes you need to encode

  • each integer in the sequence.

  • And then you group together integers

  • that require the same number of bytes to encode.

  • Use a run-length encoding idea to encode all of these integers

  • together by using a header byte, where in the header byte,

  • you use the lower 6 bits to store the size of the group

  • and the highest 2 bits to store the number of bytes each

  • of these integers needs to decode.

  • And now all of the integers in this group

  • will just be stored after this header byte.

  • And we'd know exactly how many bytes they need to decode.

  • So we don't need to store a continue bit in these chunks.

  • This does slightly increase the space usage.

  • But it makes decoding cheaper because we no longer have

  • to suffer from branch mispredictions

  • from checking this continue bit.

  • OK, so now we have to decode these edge lists on the fly

  • as we're running our algorithm.

  • If we decoded everything at the beginning,

  • we wouldn't actually be saving any space.

  • We need to decode these edges as we access them

  • in our algorithm.

  • Since we encoded all of these edge

  • lists separately for each vertex,

  • we can decode all of them in parallel.

  • And each vertex just decodes its edge list sequentially.

  • But what about high-degree vertices?

  • If you have a high-degree vertex,

  • you stop to decode its edge list sequentially.

  • And if you're running this in parallel,

  • this could lead to load imbalance.

  • So one way to fix this is, instead of just encoding

  • the whole thing sequentially, you can chunk it up

  • into chunks of size T. And then for each chunk,

  • you encode it like you did before,

  • where you store the first value relative to the source vertex

  • and then all of the other values relative to the previous edge.

  • And now you can actually decode the first value

  • here for each of these chunks all in parallel

  • without having to wait for the previous edge to be decoded.

  • And then this gives us much more parallelism

  • because all of these chunks can be decoded in parallel.

  • And we found that a value of T-- where T is the chunk size--

  • between 100 and 10,000 works pretty well in practice.

  • OK, so I'm not going to have time

  • to go over the experiments.

  • But at a high level, the experiments

  • show that compression schemes do save space.

  • And serially, it's only slightly slower

  • than the uncompressed version.

  • But surprisingly, when you run it in parallel,

  • it actually becomes faster than the uncompressed version.

  • And this is because these graph algorithms are memory bound.

  • And we're using less memory.

  • You can alleviate this memory subsystem bottleneck

  • and get better scalability.

  • And the decoding part of these compressed algorithms

  • actually gets very good parallel speedup

  • because they're just doing local operations.

  • OK, so let me summarize now.

  • So we saw some properties of real-world graphs.

  • We saw that they're quite large, but they can still

  • fit on a multi-core server.

  • And they're relatively sparse.

  • They also have a power law degree distribution.

  • Many graph algorithms are irregular in that they involve

  • many random memory accesses.

  • So that becomes a bottleneck of the performance

  • of these algorithms.

  • And you can improve performance with algorithmic optimization,

  • such as using this direction optimization

  • and also by creating and exploiting

  • locality, for example, by using this bit vector optimization.

  • And finally, optimizations for graphs

  • might work well for certain graphs,

  • but they might not work well for other graphs.

  • For example, the direction optimization idea

  • works well for power law graphs but not for road graphs.

  • So when you're trying to optimize your graph algorithm,

  • we should definitely test it on different types of graphs

  • and see where it works well and where it doesn't work.

  • So that's all I have.

  • If you have any additional questions,

  • please feel free to ask me after class.

  • And as a reminder, we have a guest lecture on Thursday

  • by Professor Johnson of the MIT Math Department.

  • And he'll be talking about high-level languages,

  • so please be sure to attend.

The following content is provided under a Creative

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it