Subtitles section Play video Print subtitles We're now going to take up a particular problem that has a very non-trivial solution. We assume the stream elements are bits, zeroes or one, and we want to know how many bits in the last N are one. If we can store the most recent N bits, we can use the solution like the one discussed for averages. In fact the algorithm would be even simpler. However, we're going to address the situation where N bits don't fit in main memory. Perhaps N is a trillion or N is a reasonable number but there, there are so many streams that we can't store complete windows for all. If we want exact answers then we can show that it is impossible to do anything better than to store the entire window. However, what is interesting is that we can store on the order of the square of log in bits, where N is the window size, and still answer queries about the counted ones with answers that are off by at most a small factor, as small as we like if we do enough work. The problem we're going to discuss is this. We're given a stream of 0's and 1's. At any time we have to be prepared to answer a query of the form. How many of the last k bits were one? Here k can be any integer from one up to some large upper limit N. We can surely do this if we use a windows size N and store the last N bits. When a new bit arrives, we throw away the oldest bit in the windows since it can never again be useful to answer one of these queries. But one disadvantage of this approach is that answering one query requires that we examine k bits, since k can be quite large, and both inputs and queries may be arriving very rapidly, that may be time we cannot afford. Another potential problem is that we may not be able to afford the space. As we just mentioned we could be trying to handle a large number of streams or N could be so large that even one window does not fit in main memory. Both these concerns suggests that we should consider a method that uses less than N space. And that also allows us to answer queries about the last k bits much faster than on the order of k. It turns out that we can't get an exact answer to queries without using N bits in the window. But we can get close using much less space than all of N, and also much less time than all of k. We're going to introduce the right algorithm with the discussion of something that seems like it should work but doesn't quite. Our goal was the be off by no more than a factor of 2 in estimating the number of ones in the last k bits. So we will summarize blocks of the stream as blocks will have exponentially increasing lengths, that is 1, 2, 4, 8, 16, so on. And the summary of a block will be simply the count of the number of ones in that block. When we want to know the count of ones for last k bits, we can find some blocks that lie wholly within the last k bits and we add up their counts. It is only the last block, the one furthest back in time that gives us pause. We don't know how many of its ones within the last k bits so we have to guess. But if we've created these exponentially growing blocks for all time units then there would be as many blocks of length one as there are bits in the window, as well as blocks or size two, four, eight, and so on. So that saves us nothing. Instead, we have to drop blocks if their left end, that is, the end that is earliest in time, coincides with the left end of the larger block. And we also drop a small block if there's a larger block completely to their right, that is, later in the stream. As a result, you never have more than two blocks of any one size. So, here is an example of the blocks we might retain at some time. The five rows of blocks are of lengths 1, 2, 4, 8 and 16. Okay, there are two blocks of length 1. The more recent has a count of 0 because it consists of a single 0. That's this. While the other has a count of 1 because it consists of a single 1. 'Kay? Here's a block of length 2. That has a count of 1 because it represents 0 1. That is, these two bits. Notice that we've previously deleted the block of length 1 that would go here, because it begins at the same point as the block of length 2 above it. Also all other blocks of length 1 are deleted because they have a block of length 2 completely to their right. We also show a second block of length 2. Its count is 2 because it represents, this 1 1. There are two blocks of length 4 and they have counts of 2 and 3. They represent, well, this guy represents this sequence, 0 0 1 1, so it has two 1's. This represents 1 0 1 1 and therefore gets a count of 3. 'Kay. We see one block of length 8. Its count is 4. Well let's see, because it represents these eight bits. And notice that, that the count for second block of length 8 is not needed because we can figure out it has six ones. Since that's tad, that 6 is the difference between the number of ones in this block of length 16 and that block of length 8. Or that is 10 minus 4 equals 6. So if this block existed, it would have, surely have six once. Okay. Now, suppose we get a query for how many ones there are in the most recent 28 bits. We can add up the counts of certain blocks. Some little blocks at the right end, and then some bigger blocks going to the left. We want to pick blocks so that each of the most recent 28 bits is covered by exactly one of the blocks we choose. So, we pick this block of length 1. This block of length 2. This of length 4. We don't want this block of length 8 because we have this block of length 16 and that's still all within the last 28. so, so far we have covered 23 bits and we know that among them the number of 1's is 0 plus 1 plus 2 plus 10, which is 13 Okay but what do we do about the oldest five bits? We that is, there are these bits now, if we could see the bits we would know that they're 0 0 1 0 1 therefor they have two 1's, but we don't see them. All we see is that they are part of this block of 16 and we know that block has a count of six, okay, but we can't tell how many of those six are in the most recent five positions. Again, we don't ever get to see this anymore. Now if we could see them of course we would know there were two and that the right answer is 15. But we need to estimate, without seeing how many ones there are in this region. Okay if we guess that half the count of the block that is 3, 6 divided by 2 in this case is is in the region we don't see then we would guess 16 and that's only off by 7% so that's not even bad. We could even try a proportional guess that is say, we know that there is 6 with, in 6 divided by 16, well, 6 divided by 16 is the probability that any given bit is 1 in the range represented by this, by this block of 16, and we know that we have to count five of them, so that's 30 divided by 16, which is roughly 2, and so if we guess 2, and added that, we would get 15. And that happens to be right on the mark even though we didn't get to see the, the, the five bits that we wanted to count those. This strategy has a lot to recommend it. Okay, first it stores only the square of log N bits. I might comment that we use log, we use this expression (log2N) to mean the square of log N. Okay this is a, a, a common expression. You don't want to write it as log N squared because that's really 2 2 log N, which is not what we mean. So I, I should, if you've never seen this notation before again the putting the square above the log means that you're actually squaring the whole thing. The, the squaring log N. okay, now. As I said, okay square, storing square of log N bits is not that bad, okay? It's much less than N for, for for large N. So if N is a billion, then log squared N is about 900. Now why do we need only on the order of log squared N bits? Well first of all, if the window size is N bits, we never need any blocks of length greater than N. An account up to n can be stored in log based 2 of N bits. Now how many counts do we need? Well, there are only log based 2 N box sizes from 1 1 to 4, 8, 16 and so on. Up to the largest power of 2 that are long, are no larger than N. So we never store more than two blocks of any size. And as a result, we need to store at most, 2 log N counts, of at most log in bits each, and that's 2 log squared N. Another good thing is that after each bit we do a limited amount of work. We have to create a new block of length 1 for each of the lengths 1, 2, 4, 8, and so on. We may have to drop a block of that length or we may have to combine two blocks of one length into two blocks of the next larger length. But that means that most order log N were total since there were log N sizes. And the error is frequently small. It can't be bigger than the count of the biggest block. The one that is only partially in the region we're counting. There's a problem with the scheme, however. When the 1's are distributed evenly among all the regions of the stream, the number of 1's in the ambiguous region can't be more than half the total number of 1's in the region we want to count. So our error is limited to 50%, but look what happens if all the ones in the region we want to count are at the left end, and in particular, are counted only by a block that is partially within the desired region. Then the true count could be anything from 0 up to the full count of that block. Anything we guess could be wildly wrong, and we'll never know. We're therefore going to discuss a similar algorithm that preserves the good and avoids the problem with uneven distribution of 1's. We'll still divide the window into blocks, but instead of letting each block cover a fixed segment of the string, we'll let each block cover a fixed number of 1's. The sizes of the blocks will still be limited to the powers of 2. That is 1, 2, 4, 8, and so on, but the notion of the size of a block changes. Now the size of a block will be the number of 1's. So we'll have blocks of size 1 to represent segments in the stream that have a single 1. Blocks with twice that size will represent two 1's and number of 0's. And then there will be blocks representing four 1's and any number of 0's and so on. The advantage of this scheme is that there are few 1's in the resent stream, the block size covering that region will stay small. They will cover large parts of the stream, while their size, or number of 1's remains limited. I decided to call the algorithm I'm going to describe the DGIM algorithm. The initials refers to the four guys who invented this algorithm Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. And in fact, this is a good time to stop and remember Rajeev Motwani who died shortly after this algorithm was published. He along with Gionis and Indyk is also responsible for locality sensitive hashing which forms a major part of this course. Like our earlier attempt and an algorithm, a DGIM stores on the order of (log2N)bits, to represent 1N bit window. There's an absolute guarantee of no more than 50% error in the answer to any query. And if 50% is too much, you can reduce the error to anything greater than 0. The algorithm becomes more complicated on the number of bits you need to store grow although the number of bits remains proportionate to (log2N). It's just the constant factor that grows in inverse proportion to the desired error bound. Okay, to begin the story we need to introduce the idea of a timestamp. Every bit that arrives in the stream gets a timestamp. You might think that we need an arbitrary number of bits to represent the time stamp since there's no limit on how long the stream can be. But it's really only necessary to represent timestamps modulo N, the window size that is we can divide the timestamp by N and take the remainder. The net effect is the timestamp start out at 0, 1, and so on up to N minus 1 and then go to 0 again, 1, 2 and so on. Regardless of where the window is in the stream, its N bits will all have different timestamps. We're going to partition the window of length N into buckets. Each bucket is represented by a record, and records can be stored in on the order of log N bits. As we shall see, we only need on the order of log N buckets to represent the window, so on the order of log squared N bits suffices. The record contents are the following. The timestamp of its end. The most recently arrived bit. As I mentioned we'll record timestamps modulo N. So we need log N bits to represent the timestamp. The number of 1's between beginning and the end of the segment. We call this count of 1's the size of the bucket. However the number of 1's in this segment represented by a bucket must be a power of 2. That explains why we only need log log N bits to represent the count of 1's. We can store the logarithm of the count instead of the count itself since, we know that log base to the count must be an integer. The count itself can't be higher then N so it's logarithm can't be higher than log base 2 of N. Since the logarithm is an integer r i, and we only need log i bits to represent the i in binary, log log N bit suffices. It really doesn't matter much, since we still need order log N bits in the record for the bucket, just to store the timestamp of its end. The partition into buckets must obey the following rules. There must be one or two buckets of each allowed sides up to the maximum size we need. Remember that allowed size is of the power-of-2. No bit of the window is part of two buckets. Some 0's in the stream may not belong to any bucket. It, it, it doesn't matter. But buckets can only increase in size as we, as we go back in time. The most recent part of the window is represented by the smallest buckets. When the end time stamp of a bucket is more than end-time units in the past, it no longer represents part of the window, so we delete it from the set of buckets whose records are stored. Here is a picture of what the partition of a stream into buckets might look like at some point. The most recent two 1's are in bucket of size 1 by themselves, and it's here and here. Further back, the previous two 1's are grouped into a bucket of size 2. It's that there might be two buckets of size 2 but there could also only be one as in this case. Then going further back in time we see the previous four 1's in a bucket of size 4, and the four 1's before that are also in a bucket of size 4. Then we see two buckets of size 8. And finally a bucket of size 16. The end-time stamp of this bucket is still within the window of length N. That's this. Although its beginning is outside the window. We still need this bucket, but any previous buckets have a time stamp that is prior to the beginning of the current window, so we have deleted their records. So let's see how we manage the buckets as bits arrive on the stream. The first thing we're going to do is worry about whether we need to drop the oldest bucket. We need to keep outside the bucket representation, the count of the number of bits that have ever arrived in the screen. However we only need this count modulo N so an extra log in bits is all we need. When a new bit come in, increment that count. Of course if the count reaches N then we set it back to 0. That's how modular arithmetic works. Now, look at the end-time of the oldest bucket. If its time stamp agrees with the current time, then that time stamp is really the current time minus N since we're computing all time stamps modulo N. The entire oldest bucket is therefore out of the window and we delete its record. But if the time stamp is anything else then the oldest bucket still has it's end within the window so it remains. What we do next depends on whether the bit that just entered is 0 or 1. If it's 0, then we make no further changes to the set of buckets. That was easy. If the current input is 1, we have some work to do. But the work is at most logarithmic in the window size N. First we create a new bucket for the new bit. The size of the bucket is 1, and its ending timestamp is the current time. There might have been one or two buckets of size 1 previously. If there's only one, now there are two, and that's fine. We are allowed to have one or two of any size. But, if there we previously two, now there are three. We can't have three buckets of size 1 so we combine the oldest two into one bucket of size 2. Combining consecutive buckets of the same size is easy. We add 1 to the logarithm of the size, and we take the N timestamp to be the N timestamp of the more recent of the two. So, for example here are two buckets could be of, of consecutive buckets of any size, let's say 2 to the x. We combine them into one bucket of size 2 to the x plus 1, by simply, this bucket gets this ending time. I just copy it from here. And we add 1 to the size, which essentially says, therefore it's, sorry, the size is doubled and then we just make that go away. But our work might not be over. If we had to create a bucket of size 2 we might now have three of that size. So we combine the earliest two into one bucket of size 4. And the problem could ripple through the sizes. If we just created a third bucket of size 4 then we could have three buckets of size 4. We need to combine the earliest two into a bucket of size 8 and so on. But because we're doubling the bucket's size each time we pass the problem to the next level, after log N fix ups we've reached a bucket size as large as the entire window and there's never need for a larger bucket. 'Kay. The rippling effect therefore stops after at most log N rounds. And each round may, takes a constant amount of work. So O(logN) is a guaranteed upper bound on the total time needed to process an incoming 1. Usually the, the time required is much less, and on the average, it is constant. On this slide, we'll see the changes that occur as bits enter the system. So here's the initial state of the window. A 1 enters. We create a bucket of size 1 for it, this. But now they have three buckets of size 1. So we're going to have to combine the two earliest 1's. This one. And that one. Okay, so here we've done the combination. What has happened in terms of the records is that the record for this bucket is deleted. The size for this record has changed from 1 to 2. And it's time stamp has not changed, it has therefor actually become this record. And notice that, that 1 is really inside of the record that this slide is not shown perfectly there. Now I'm showing what happens after another 101 arrives. Okay? The first of these 1's created this bucket, and then the 0 came in represented that nothing changed. And then this next 1 arrives. And now, we get a third bucket of size 1. Okay, so that causes these two buckets to get combined into this guy. And now we have three buckets of size 2. So, we have to combine these two by that one really belongs in the in, in, in the middle bucket of size 2. So, we combine these two into to a bucket size 4. And that made three buckets of size 4 so these guys got combined into that bucket of size 8, but that was a third bucket of size 8. So these buckets of size 8 got combined into that bucket size 16 now there can't be more buckets of size 16. There's this one but that extends beyond the end of the of the window. So we're done rippling changes to larger and larger buckets. Now I want to explain how to query the system. So suppose we want to know how many 1's there are in the last k bits where k is any integer less than or equal to N, the window size. First thing we want to do is to ignore all buckets whose ending timestamp is earlier than k bits prior to the current time. Those buckets are all outside the range we want to count so they make no contribution. Start by summing the sizes of all the buckets except the oldest bucket that is still in the range we are interested in. Then add half the size of that bucket. Okay. The reason we only had half the oldest bucket size is that we really don't know how many ones from that bucket are still within the range of interest. By guessing half, we minimize the maximum error as we'll discuss on the next slide. So here is why the estimate can't be off by a factor of more than 50% from the true answer. For a supposed that the oldest bucket in the range we're interested in has size 2i. We assumed half, or 2i minus 1 of its 1's are among the most recent k bits to arrive. The true number could be anything between 1 and 2i so our error is upper bounded by 2i minus 1. Now what's the smallest the true answer could be? There is at least one bucket of each of the sizes less than 2i that lies completely within the last, k bits. These account for at least 1 plus 2 plus 4 plus so on, up to 2i minus 1. And that's 2 to the, that sum is 2i minus 1. Now we add 1 for the 1 that is at the end of the oldest bucket. That bucket has an ending timestamp that's within range. And buckets always end in a 1 so there is, there are at least 2i 1's within the range. Okay, since our error is no more than 2i minus 1. That error is at most 50%, you know? We're not going to discuss the extensions here but it is possible to modify the algorithm described to limit the error to any fraction we like greater than 0, while still using only on the order of log squared N bits to represent all the buckets we need to represent. The textbook describes how to do this.
A2 US bucket size log block count length 4 14 Counting 1 's 29 00 Advanced 22 2 HaoLang Chen posted on 2017/08/24 More Share Save Report Video vocabulary