Subtitles section Play video Print subtitles [MUSIC PLAYING] DAVID MALAN: All right, this is CS50. And this is our look today at data structures. You'll recall last week that we gave ourselves a few new ingredients and some new syntax in C. We introduced pointers, the ability to address chunks of memory by actual addresses-- 0, 1, 2, 3 and on up. And we introduced the star notation and a few other functions, malloc among them, free among them, so that you can actually manage your computer's memory. Just to hammer this home, let's take a look at just this small example that from the outset is buggy. This code is buggy. And we'll see in just a moment why. But let's just walk through it step by step. This first line highlighted in yellow in English is doing what as you understand it now? If you're a little rusty since last week, what's that first line of code doing in English? Anything? Yeah. AUDIENCE: It's creating a point to an int named x. DAVID MALAN: Perfect. It's creating a pointer to an integer and calling that pointer or that variable x. And let me propose that the next line is doing the same thing giving us another pointer to an integer but this time calling it y. The third line of code, someone else, what's happening in English with this line here? Yeah. AUDIENCE: It's creates a memory of size of int and assigns it to x, but it's more difficult to execute if it's not a pointer, I don't know. DAVID MALAN: This is not the bug. But the first part is correct. malloc is the function we introduced last week that allocates memory for you. It takes one argument, the number of bytes that you want to allocate. Even if you don't recall how many bytes you need for an integer, you can call this other operator, sizeof, that we saw briefly last week, that will return, in this case, 4 most likely depending on the computer that you're on. So this is saying, hey, computer give me 4 bytes of memory. And it returns that chunk of memory to conventionally by way of the first address, so ox something, wherever those 4 bytes happen to star. And then it stores that address in x, which is in fact OK, because as you noted initially, x is in fact a pointer. That is an address. So all this is doing is it's declaring a variable called x and storing in it ultimately the address of a legitimate chunk of memory. You wouldn't typically allocate an int like this. You would allocate an int with just int and semicolon just like in week 1. But now that we have the ability to allocate addresses and allocate memory, you could achieve the same idea here. This line here now, the fourth line of code, says what in English? Star x equals 42 semicolon. What's going on there? Yeah. AUDIENCE: It goes to the address in the x and then sets it to 42. DAVID MALAN: Good goes to the address in x and sets it to 42. So the star operator is the dereference operator, which is a fancy way of saying go to that address. And what do you want to do? Well, per week 1 when we discussed the assignment operator, it just says put the number 42 there. So wherever malloc found 4 bytes of available memory for me, this fourth line of code says, go there and put the number 42 by putting the appropriate zeros and ones. This last line here-- and here's the bug, if we finally reveal it here-- this line is buggy. Does anyone see why? Yeah, over here. AUDIENCE: You haven't allocated memory for that variable yet. DAVID MALAN: Exactly. I haven't allocated memory for that variable yet. And because up here I've just said int star y semicolon. You can only assume safely that has some garbage value, some unknown value, maybe remnants from some other part of the program, that might not necessarily be true here at the beginning of the program. But for safety's sake assume that if you don't give a variable a value, who knows what it contains. It's got some bogus address, such that if you say star y go to that bogus address something bad is going to happen. And maybe you experienced this already in P Set 4 or prior, some kind of memory problem with your code, or a segmentation fault or seg fault, bad things happen when you go to addresses that don't exist or that you don't even know where they are. So this line of code is bad. But we can do a little better. What if instead I do something like this? I actually assign to y x. So that just says put in y the same address that's in x. And then with this last line of code, what if I now say star y equals 13? What is that-- you're nodding your head. What am I doing correctly now? AUDIENCE: Now there's memory allocated for y. DAVID MALAN: Good. Now, there's memory allocated for y. So you're saying go to that address and put 13 there. However, what did we just need to the 42, just to be clear? We clobbered it. We overwrote it with the 13. Because if x and y are the same address, both this says go to that address and put 42 there, but then two lines later, we say, no, no, no, go there and put 13 there instead. But long story short, bad things happen when you don't necessarily anticipate what is in memory and you don't allocate it yourself, so thanks to one of our friends at Stanford, allow us to take a moment here to hit Play on a short film, a claymation if you will, that paints the same picture in a more memorable way perhaps. If we could dim the lights. [VIDEO PLAYBACK] NARRATOR: Hey, Binky, wake up. It's time for pointer fun. BINKY: What's that? Learn about pointers? Oh, goody. NARRATOR: Well, to get started, I guess we're going to need a couple pointers. BINKY: OK, this code allocates two pointers, which can point to integers. NARRATOR: OK, well, I see the two pointers. But they don't seem to be pointing to anything. BINKY: That's right, initially pointers don't point to anything. The things they point to are call pointees. And setting them up is a separate step. NARRATOR: Oh, right, right. I knew that. The pointees are separate. So how do you allocate a pointee? BINKY: OK, well, this code allocates a new integer pointee. And this part sets x to point to it. NARRATOR: Hey, that looks better. So make it do something. BINKY: OK, I'll dereference the pointer x to store the number 42 into its pointee. For this trick, I'll need my magic wand of y dereferencing. NARRATOR: Your magic wand of dereferencing. That-- that's great. BINKY: This is what the code looks like. I'll just set up the number. And-- NARRATOR: Hey, look, there it goes. So doing a dereference on x, follows the arrow to access its pointee, in this case, the store 42 in there. Hey, try using it to store the number 13 through the other pointer, y. BINKY: OK. I'll just go over here to y and get the number 13 set up and then take the wand of dereferencing and just-- [BUZZER SOUND] oh. NARRATOR: Oh, hey, that didn't work. Say, Binky, I don't think dereferencing y is a good idea, because, you know, setting up the pointee is a separate step. And I don't think we ever did it. BINKY: Oh, good point. NARRATOR: Yeah, we allocated the pointer y. But we never set it to point to a pointee. BINKY: Mm, very observant. NARRATOR: Hey, you're looking good there, Binky. Can you fix it so that y points to the same pointee as x? BINKY: Sure. I'll use my magic wand of pointer assignment. NARRATOR: Is that going to be a problem like before? BINKY: No, this doesn't touch the pointees. It just changes one pointer to point to the same thing as another. NARRATOR: Oh, I see. Now y points to the same place as x. So, wait, now y is fixed. It has a pointee. So you can try the wand of dereferencing again to send the 13 over. BINKY: Uh-- OK, here goes. NARRATOR: Hey, look at that. Now dereferencing works on y. And because the pointers are sharing that one pointee, they both see the 13. BINKY: Yeah, sharing, whatever. So are we going to switch places now? NARRATOR: Oh, look, we're out of time. BINKY: But-- [END PLAYBACK] DAVID MALAN: All right, so now that we do have this power of pointers and addresses where we have low level access to the computer's memory, we can actually solve problems a lot more powerfully and in a lot more interesting ways. But first, let's motivate some of these problems. So back in Week 2, we introduced arrays, which was the first of our data structures, if you will. Before then in Week 1, all we had was variables for things like ints and chars and floats and so forth. In Week 2, we introduced arrays, which meant you could store two ints altogether or three or 10 or 100. So you can kind of encapsulate lots of data together. So unfortunately, though, arrays aren't quite as powerful as might be ideal. So, for instance, if we have an array with size 3 and we actually want to go ahead and store three values in it-- one, two, three-- suppose that we actually want to now store a fourth value, but we didn't anticipate that from the get go. Recall after all that with arrays you have to declare their size upfront. So you've got to hard code the number 3 or a variable containing the number 3. But suppose that we want to store the number 4. You might think that, well, just give me another box of memory just to the right of the number 3, so that I can keep all of my numbers together. But unfortunately, per last week, that's not really a reliable assumption, because in the context of the rest of your computer's memory, that 1, 2, 3, might be here surrounded by other bytes. And per last week those bytes might be mostly filled with other data from some other parts of your program. And yet you would think that in seeing that 1, 2, 3 is kind of painted into this corner, so to speak, that there's just no room for the number 4, and therefore you can't add the fourth number to your array, is there a solution visibly to this problem nonetheless? Where else could we put it? Yeah. AUDIENCE: Move it to off of other memory. DAVID MALAN: Say that a little louder. AUDIENCE: We can move it off to other memory. DAVID MALAN: Yeah, so maybe we can move it off to other memory. So there's a lot of EMMAs in my memory per last week, but there is still, it would seem, based on this picture, some unused memory. So maybe we could resize our array, grow it, not by just moving all of the EMMAs because frankly that would seem to take a lot of time if we had to shift all of these characters, why don't we just relocate the 1, 2, 3 down here, and that gives us an extra space for at least a number 4. So indeed even if you're using arrays, you can achieve this outcome by actually moving memory around. But consider what's involved in that. So if you've got our old array at top left, and we've got our new array at bottom right, that is of size 4. So we have plenty of room. How do we go about resizing the array? Well, it's kind of an illusion. You can't just resize the array when we have all of these EMMAs surrounding us. Instead, we actually have to move the array or copy it. So the 1 gets moved to the new memory. The 2 gets moved to the new memory. The 3 gets moved to the new memory. And then at that point, we can just throw away or free the previously used memory and now go ahead and add our 4. Unfortunately, this isn't necessarily the best strategy, right, because if these three lockers represent our original memory and these four lockers represents our new memory and they're deliberately far apart, that is to say that if I want to go ahead and move like these same numbers, I really have to do something like this, which involves quite a few steps. Let me go ahead and put the 1 in there now. Now, let me go ahead and get the 2 here. And then I can go ahead and put this in here. So now I've got the 2. And then lastly, I can go grab the 3. And so even though I did this pretty quickly on the screen, the reality is there's a decent amount of work to do. And then I still, of course, have to go ahead and add the 4 to the mix, which is to say that I've taken figuratively and physically quite a few steps in order to resize an array from size 3 to size 4, which is to say if we now consider the efficiency or, if you will, inefficiency of that algorithm, what kind of running time is involved when inserting additional numbers into an array as I've done here? Here's our menu of options from a couple of weeks ago when we focused on algorithms. What's the running time of insertion into an array based even on that simple demonstration would you say? What's the running time? Yeah. AUDIENCE: O n squared. DAVID MALAN: Say it again. AUDIENCE: O n squared. DAVID MALAN: O n squared. So maybe O n squared in that there was a lot of back and forth and we've seen that before. We've seen bubble sort and selection sort add up. It's not quite as bad as that. It's not quite as bad as that. Yeah. AUDIENCE: O of n. DAVID MALAN: O of n. And why do you say O of n? AUDIENCE: Because for as like as many lockers there are in the first one, you have to increment the same amount of processes to insert them. DAVID MALAN: Exactly. Whatever number of lockers you have here-- so that's three specifically-- but n more generally, it's going to take me n steps to transfer those numbers over here. Or technically, it's going to take me 3-- maybe if I go back and forth, it's like 6 steps. But it's some multiple of n. So it's not n squared. That's when we kept iterating again and again and again. This time I just have to move 3 numbers to here and then add the fourth number. So it's indeed, Big O of n when you want to go ahead and insert or search equivalently an array that's actually implemented-- sorry, insert is going to take us linear time. But search recall-- and this was the powerful thing-- what's the running time of search so long as you keep your number sorted? Per two weeks ago, that was logarithmic. So we haven't necessarily sacrificed that. And that's the appeal of storing our data in an array that's sorted. You can use binary search. However, this is expensive and moving things around isn't necessarily the ideal approach. So let's just consider what this might look like in code. Let me go over to CS50 IDE here. And let me go ahead and create a new file called list.c. And let's see if we can't represent in code exactly this idea. So let me go ahead and include for myself standard stdio.h just so that we can print out some values ultimately. Let me go ahead then and declare main-- int main void. And then down here, let's just arbitrarily start where we did with three integers, called list and size 3. So I'm just mimicking exactly where we started pictorially by having an array that was fixed at size 3. And then if I went ahead and initialized that list, I could just hard code-- that is type into the program itself-- those three values into bracket 0, 1, and 2 the numbers 1, 2, 3 respectively. So I'm just manually initializing that array to three values. And then just so that this program has some purpose in life, let me go ahead and do int i equals 0, i less than 3, i++. And then, let's just print out these elements just for good measure. Each of them is an int. So we'll use %i. And then I'm going to go ahead and print out list bracket i. So kind of a Week 2 style program, where All I'm doing is hard coding an array of size 3, initializing it with three values, 1, 2, 3; 0 indexed, and then printing them out. So if I go ahead and save this and make my list and then go ahead and compile this with ./list, I should see hopefully 1, 2, 3. But there's a problem with this implementation fundamentally because I have hardcoded-- that is typed explicitly-- the size of this array, how can I go about adding a fourth element? What would I have to do? Well, I could change the code up here to 4. And then I could add another line here. And then I could change this. But then I have to recompile it. And so it's certainly not dynamic. But we did see a function last week that lets you allocate more memory dynamically. And just to be sure what was that function? So malloc. Right? Now that we have malloc, you don't have to type into your program source code from the get go a fixed number. You can actually allocate some amount of memory dynamically. Now, here just for demonstration's sake, we'll do it to achieve the same goal, but in a way that's going to scale a little more effectively. Recall from last week that if you want to get a chunk of memory from malloc, it's going to return the address of that chunk of memory. So that suggests that I can declare a pointer to an integer called list. And then let me go ahead and allocate, how about, three integers initially times whatever the size is of an integer. So this is a little weird looking, but consider what this is doing. malloc is being asked for 3 times the size of an int. So give me enough memory to fit three integers. By definition that returns a pointer, per last week. So we have to assign it to a pointer on the left. So list is a variable now, just like x and y from our previous example, that's storing the address of that chunk of memory. But what's cool about C is that now that you know that list is a chunk of memory, we can actually borrow that same square bracket notation from Week 2. And this code here doesn't actually need to change. If you use square bracket notation next to the name of a pointer, what's going to happen for you automatically is the computer is going to go to the first byte in that chunk of memory. This index is going to go to the next chunk of memory. This is going to go to the next chunk of memory, all within the scope of what malloc returned for you. And just as an aside, how many bytes are in an integer? AUDIENCE: 4. DAVID MALAN: 4. And recall I briefly mentioned the expression last week pointer arithmetic. What you're also getting sort of magically with this square bracket notation is that bracket 0, it happens to be byte 0. Bracket 1 is not the second byte. It's actually 4 bytes over. And bracket 2 is not the third byte. It's actually 8 bytes over, because you allocated 4 plus 4 plus 4, 12 bytes. And so this square bracket notation just jumps you to the right place in that chunk of memory, so that you can fit int, int, int. Yeah. AUDIENCE: Why do you allocate a pointer to an int rather than an pointer to an int array? DAVID MALAN: Why do you allocate a pointer to an int and not a pointer to an int array? In this context, arrays and pointers are in some sense the same. A pointer is an address of memory. An array is just a chunk of memory. And so even though we used chunks of memory in Week 2 by just calling them arrays, they really are just more generally chunks of memory that support square bracket notation. But now that we can allocate as much memory as we want, we can kind of use these two concepts interchangeably. And there are some subtleties in C. But this now has the same effect as Week 2. And this is the only new line from this week. But now if you're using malloc, even though I'm not going to do it in a more complicated program here, you can imagine now the code running in a loop and maybe allocating more memory and more memory and more memory when you need it, because malloc allows you to do just that. And we do need to do a couple of safety checks here. It turns out, per last week, that malloc can sometimes run out of memory. If you're Mac or PC or the cloud runs out of memory for your account, well, you might want to check the return value. And so good practice would be, well, wait a minute, if list equals equals null, let me just go ahead and return 1, something went wrong, because my computer is out of memory for some reason. So best practice would say anytime you allocate memory, always check if you've gotten back null. Now, let me just do something for the sake of demonstration. Let me move my window down here. Let me highlight these lines of code and just make the claim that highlighted here between lines 5 and 13 are lines of code that simply allocate a list of size 3 and store three values in it. That's the story where we left off a moment ago. Suppose now I change my mind and decide after line 13 or maybe elsewhere in this program if it were larger, you know what, I actually want another integer. I want to resize that array. Well, how can I go about doing it? Well, let me go ahead and do this. Let me go ahead and allocate, for instance, another address and say store at that address a chunk of memory corresponding to four integers using the size of operator as before. So temporarily, let me go ahead and give myself a new chunk of memory that is big enough to fit four integers instead of just three. Let me practice best practices and say, you know what, just in case, if temp equals equals null because I'm out of memory, forget it, I'm done with the program. We're not going to proceed anyway. But that's just good practice now. But now what I want to do? If I now have two chunks of memory, this one is a size 3, this one is of size 4, what did we do the last time we wanted to move something around in the computer's memory, what did I physically do with the lockers? I think you're nodding. What did I do? AUDIENCE: You went through each and move 1 to the-- DAVID MALAN: Yeah, exactly, I went through each one and copied the value from left to right, from old to new. And so let me go ahead and do exactly that. I think I can do this with a for loop, for int i get 0, i is less than 3, because that's the size of the old array that size 3, i++. And then in this iteration, I can just do something like this-- use this new chunk of memory just like an array, because I claimed I can use my square bracket notation and store location i whatever is in the original list at location i. So this code here now, if I were to comment it, copy integers from old array into new array. And that too is just using a for loop from old to new. But now that's not quite everything I want to do. I also want to store at the location 3, 0 index, which means it's the fourth location, another value, number 4. That's why I put the additional number 4 into those lockers. So now with these lines of code, I have implemented the physical notion of copying all of the values from the old array into the new array. So I'm almost done, except what did we learn last week that you should do whenever you're done with a chunk of memory? Do I still need the original chunk of memory? AUDIENCE: No. DAVID MALAN: No. And how do I give it back to the computer? AUDIENCE: Free. DAVID MALAN: Free. So quite simply, I literally just call free, passing in the address of the chunk of memory that I want to free. And even though I'm passing in one address, the computer is going to do the heavy, lifting of remembering how many bytes I asked for originally. You don't have to worry about that. You just say, whatever this is pointing at, go ahead and free it all. So now, you know what, now that I've gotten rid of that list, I'm going to update list equal temp, which is just cleaning up the naming. Temp is kind of a stupid name for a list. Let me just reuse the original pointer and let list equal temp. And now down here if I've done everything correctly, it should suffice to print out that whole list. So let me save this. Let me give myself a bigger terminal window. Do make list again. A lot of mistakes. Let's see, first one is up here. Implicitly declaring library function malloc dot dot dot. What generally is the solution when you see implicitly declaring something? AUDIENCE: Header file. DAVID MALAN: Header file, which one do I want? Do you recall? This is subtle and we might not have used it last time if I used the CS50 library, but it's stdlib.h. That is where malloc is. That is where free is. So stdlib.h is one new header file that contains last week's functions. So let me try this again. Let me make list. Nice, this time it did compile. ./list and-- hm, I seem to be missing that fourth number. But I think this is just a stupid mistake on my part. What did I do wrong if I look at the printing of this array? AUDIENCE: Size of list. DAVID MALAN: Yeah, down here the new list is size 4. So frankly, if you recall a few weeks ago, I encouraged you, don't just hard code numbers, magic numbers, in your programs. We should really be using constants or some other variable. This is exactly why, because you, just like me, might overlook a detail like that. So let me recompile it. And let me do list. And, voila, there is my 1, 2, 3, 4. Now, to be clear, this is kind of a stupid program, because I sort of decided halfway through writing this program, wait a minute, I want four integers, not three. And, of course, at that point, you should just delete the earlier code you wrote. So this is just for demonstration sake. If you imagine this being a bigger program, that just over time the human decides maybe because get int is called that they need more memory, this is how you would do it using malloc. But it turns out there's a function that can actually make our lives a little easier here. So let me go ahead and clean this up just a little bit. It turns out that I don't have to allocate more memory myself, copy all of these things myself, and then also free it. I can consolidate a bunch of these lines as follows. Instead of using malloc, I can actually say realloc, which as the name suggests, reallocate a chunk of memory. What do you want to reallocate? Well, I want to reallocate the list. And this time I want to do 4 times size of an int. I'm going to store this in temp temporarily. I'm going to make sure that nothing went wrong, as by checking for null, which just means, hey, you might be out of memory. And then I'm going to return if so. But down here, if all is well, I'm going to go ahead and do this. And watch this I now have simplified my code by quite a few lines. realloc, by definition-- this is another function in stdlib.h-- handles the process of taking an existing chunk of memory that you already asked for, resizes it to be this new size, whether it's bigger or smaller. It handles the copying of the data from old to new to you. And you just need to check that nothing went wrong as by checking for null here and then remembering the new value. So this code now, which is just six lines of code, previously was more than that. And it's just a handy function to use. All right, a question from earlier. AUDIENCE: Why can't we just create this to the temp in the beginning, because if we equate this to the temp, then we equate this to the pointer perhaps, so this to the point to the 4 bytes of memory? DAVID MALAN: Really good question. So let me roll this back by rewinding. And all of the finished versions are on the course's website if you want to play with them later. This was the previous version using just malloc. If you just do this, update a new chunk of memory, as I think you're asking, what's happening is you are effectively orphaning the old chunk of memory. Because if you change what's stored in list and have it store the new chunk of memory, where'd the old chunk of memory go? It's sort of floating there in your computer's memory. But you've lost all pointers to it. There's no arrow anymore pointing to it conceptually. So that's why we have to jump through these hoops of having a temporary variable just so that we don't lose track of things we've allocated. And we'll see this later today with another data structure as well. Yeah. AUDIENCE: Somebody asked this, but I don't understand that if you initialize temp as a pointer toward integer, then does it not create problems that you use it as an array. DAVID MALAN: Good question. If you initialize temp as a pointer to an integer, does it not create problems that you're using it as an array? Short answer, no, because again an array by definition from Week 2 is just a chunk of memory. And in C you can use the square bracket notation to jump to random parts of that memory using simple arithmetic, bracket 0, 1, 2, and so forth. Last week when we introduced malloc and free and now realloc, you now have a more low level way of allocating as much memory as you want. So it's a more powerful, general purpose mechanism. But at the end of the day, you're still getting back a chunk of memory, contiguous memory, bytes that are back to back to back. So you can certainly still use the square bracket notation because essentially an array is a chunk of memory. And malloc gives you a chunk of memory, ergo you can treat it as an array. They really are equivalent in that sense. You just don't get as many user friendly features as with arrays, like them being freed for you, as we never until last week do we have to free chunks of memory. Arrays do that for you automatically thanks to the compiler. Yeah. AUDIENCE: Do you not have to recreate the list for temp after line 37. DAVID MALAN: Yes. Thank you. So there is a bug here. And if I ran Valgrind on this code, I would see exactly that, that I'm leaking some number of bytes. So indeed, at the end of this program, I want to free the-- let's make sure-- yep, I want to free now the new chunk of memory, which is a size 4, to avoid exactly the problem you identified. Good catch. All right , so just for now, even if that's a lot of code, let's consider now higher level takeaways from this, just so that we can then motivates an alternative approach that allows us to stitch together somewhat fancier data structures instead. So in general, a data structure is just a programming construct in C, C++, Java, Python-- we'll see them in different languages over the remainder of the term-- that allow you to store information differently in your computer's memory. In C, everything we're about to do today is thanks to these three features of C. So even though this may feel like a lot of syntax, everything we do today boils down to three pieces of syntax that you have seen before. Struct, this recall was a keyword in C that allows you to create your own structure. For instance, a couple of weeks ago, we created a person structure, who had a name and a number. And that gives us our own data type that is structured to contain two values, like name and number. You use structures as well for the problem set involving bitmaps, the bitmap header and so forth. Those were data structures as well. What do we use the dot notation for just to be clear? And you definitely use this when manipulating red and green and blue pixels recently. Yeah. AUDIENCE: To access a property of a structure. DAVID MALAN: Exactly. To access a property of a structure. So if you want to get at a person's name or get at a person's number, you use the variable's name and then a dot operator to get inside of that data structure. So we've seen that before. Then last week and again today, we see the star operator, which can kind of mean different things in different contexts. But it's always related here to memory as of last week. This is the dereference operator that allows you to go to a chunk of memory by way of this thing called a pointer. So even if today feels like a bit of that fire hose-- and again per my email, this is where things now begin to level off-- realize that it all boils down to first principles, or if you will sort of three scratch-like puzzle pieces that we're now going to assemble into more interesting solutions to problems. So allow me to introduce something called a linked list. A linked list, as we'll see is going to allow you to store a list of values. Now an array allows you to store a list of values. But what are some of the downsides with an array? Well, an array is a fixed chunk of memory. And if you want to resize that array to add more values to it, what do you have to do? Well, you minimally have to allocate more memory. You need to copy all of those values from old to new. And then you can go about your business. Now, realloc is a function that makes that a little simpler. But realloc is doing the exact same legwork that I was doing between the lockers of copying value, freeing memory, and so forth. So it needs to be done. And that's why insertion into an array is going to be big O of n, because it might take you that much time to copy the whole array into a new space. So that feels kind of suboptimal, right? Arrays can be slow in that sense. But what was the appeal of an array? Like what's good about arrays? Because we don't want to abandon them entirely. Yeah. AUDIENCE: You can index into them really easily. DAVID MALAN: You can index into them really easily, not only syntactically with the square brackets, but you have constant time access-- this is known as random access. And it's not random in the sense that you just end up who knows where. You can just jump to bracket 0 or 1 or 2 instantly. It took me, the human, more time because I had to physically walk. But a computer is going to be able to jump to 0, 1, 2, 3 instantly. And so arrays are super fast. And they lend themselves to things like binary search as we've seen now for some time. But what if we use the canvas that is our computer's memory like a little more cleverly? We don't have to just plop things next to each other, next to each other, next to each other, and then hope for the best hope that there's still more memory back to back to back. What if we instead are a bit more clever about it? And suppose we want to store the number 1. And that happens to be an address 0x123. It's arbitrary. But recall from last week that every byte of memory in your computer is stored somewhere. So let's propose that 1 is stored at 0x123. Suppose now that this represents an array of size 1 and you want to add a second value to this array. Or let's start calling things more generally a list. A list like in the real world is just a list of values. This list is of size 1. Now maybe there's a lot of EMMAs in this memory that are getting in the way. But suppose that there is some free space a little lower in your computer's memory over there. So it's not here. It's not here. It's not available here or here or here. There's other stuff there. But suppose the computer does have some available memory over here in which you can store the number two, just because. And that address happens to be 456. Finally, you want to store third value. And it turns out that the nearest possible location is down here, number 3. That happens to be at address 0x789. So this is not an array by definition, because the 1, the 2, the 3 are not contiguous back to back to back. You cannot square bracket notation here because square bracket notation requires, per Week 2, that all of your values be next to each other, just like the lockers here. This picture, where 1 is over here, 2 is over here, 3 is over here is more like, oh, maybe this is 0x123. Maybe this is 0x456. Maybe this is 0x789. They're kind of all over the place. And that's just because that's what's available in your computer's memory. But what if I get a little extravagant and I start to use, not just one chunk of memory to store each value, like 1, 2, 3, what if I go ahead and give myself twice as much memory just to give myself some flexibility? So I now conceptual use this chunk of memory to represent one. This junk to represent 2, this chunk to represent 3. But you know what I'm going to use the latter half of each of those chunks for? Any thoughts? AUDIENCE: Address to the next. DAVID MALAN: An address to the next chunk of memory. So, for instance, if my goal is to keep this list sorted, so I want conceptually to have a list that stores 1, 2, 3, why don't I use this as sort of a map or a breadcrumb, if you will, that points to the next chunk of memory? And why don't I use this chunk of memory to point at the next one? And then this chunk of memory, you know what, I just need a special value here. What would be a good arbitrary way to say, mm, mm, there's nothing more in the list? AUDIENCE: Null. DAVID MALAN: It's something called null. And this technically is different from backslash 0, which is a char. This is something called-- well, this is in hexadecimal 0. Now, starting today-- and we saw the super briefly last week-- this is n-u-l-l with two L's-- this was stupid left hand not really talking to right hand-- n-u-l is backslash 0, which is a char. n-u-l-l is a pointer. But they both equal 0 underneath the hood. So you just store special value that says that's it for the list. Now, we last week I proposed who really cares where things are in memory? So indeed, let's do that again. Let's just use pointers drawn as arrows in this artist's rendition to say this list of numbers, 1 2, 3, is now linked. A linked list is just a data structure containing multiple chunks of memory that are somehow linked together. And if underneath the hood, so to speak, they're just linked together by way of pointers, and the price we pay is that rather than now in a linked list storing just the numbers 1, 2, 3, which we could have in an array, now you have to store twice as much information, 1, 2, 3, as well as three pointers, two of which are in use, the other of which is ready to go if I want to add something to this list. So this is to say we can now create structures that look like this in the computer's memory just by using this new feature of pointers. What might these structures look like individually? Well, any one of these numbers has two fields it seems. One is an integer. We'll call it number. And then there's another field here. That let's call it next by convention, but we could call it anything we want. It's just another chunk of memory that's pointing to the next element in the list. Well a couple of weeks ago, we introduced persons. And a person had a name and a number. That's not relevant today, because we're not dealing with names and numbers. We're just dealing with integers. So let me propose that we back that up and still use the same syntax as a couple of weeks ago. But instead of defining a person, let's call this rectangle a node. So this is a term of art in computer science node-- n-o-d-e-- just represents this rectangular concept, a chunk of memory that you're using for interesting purposes. It's sort of a node in a graph if familiar from math. But what do I want this know to store? Well, let me go ahead and store a couple of things in it. One, a number, and that's just going to be an int. And I'm going to go ahead and call it number. And then any guesses as to what the second field should be declared as? I want to call it next just because it's conventional. What should its data type be? Any thoughts? Yeah in back. AUDIENCE: A pointer. DAVID MALAN: A pointer. And a pointer to what would you say? AUDIENCE: The next number. DAVID MALAN: A pointer to the next number, and not quite the next number per se, because now we have not numbers only, we now have nodes. So those three yellow boxes, 1 2, 3, those are now nodes, I would say. So you know what? Let's go ahead and call this node star. But you can't technically quite do this. It turns out that C, recall, takes you super literally. And notice, if you read this code top to bottom, left to right, at which point in the story does the word node come into existence? Like not until the very last line. That's where we mentioned person. This is where we mentioned node. So, unfortunately, you can't use a node inside of a node, because it literally doesn't exist in the computer's mind until two lines later. So there's an alternative here. There is a solution. It's a little more verbose. Instead of just saying typedef struct, you actually say typedef struct node. Just add the word that you want to use. And now down here, you can say struct node star next. It's kind of like a work around. This is the way C works. But this is the exact same idea. This means that any node in the data structure, any yellow rectangular box has a number and a pointer. And that pointer by definition is a pointer to a node structure. We just have to express it more verbosely here, because the shorthand notation node doesn't exist until the bottom. It's just sort of an annoying reality of the syntax. All right, any questions on that definition of struct? Yeah. AUDIENCE: Do you have to put node twice? DAVID MALAN: Do you have to put node twice? You don't have to put node twice. You can actually use any word here you want. You can call this x or y or z. But then you're going to have to make this be struct x or struct y or struct z. By convention, I would just reuse the same term. So this is the formal name for this structure. It is a struct node. This is the nickname for the structure, more succinctly, node. And that's what typedef does. It gives you an alias from struct node to just node, just because it's easier to type elsewhere in the program. Other questions? All right, so what can we do now with this structure? Well, let's go ahead and build something up here. All right, so this is about as scary as the code gets today. We'll focus primarily on pictures and concepts hereafter. But let's take a tour through one implementation of this same idea of a linked list. How would we go about representing a linked list initially? Well, initially the list is empty. And if you want to represent something that's empty, we minimally need something. So let me draw it as this empty box. And this is just a pointer to a node, I claim. So how do I implement the notion of a linked list that has no numbers in it yet? Well, why don't we just use this, which I can implement as follows. node star, and I'm going to call it list, but then set it equal to NULL. Right, if there's no numbers available-- there's no 1, there's no 2, there's no three-- I should at least have a variable that connotes there is no list. And the easiest way to do that is in the absence of a value, store 0, which has this new nickname as of last week and this, called null. So this variable here represents this picture here. And notice, there's no numbers, because the list is empty. But we do initialize it to NULL so that we don't think that there's an arrow pointing to some specific chunk of memory yet. Because there isn't yet. Now, suppose I want to go ahead and insert a number into this list. Suppose I want to insert the number 2. I can't just allocate space for 2 now. I have to allocate space for 2 and that pointer, otherwise known together as a node, per the previous slide. So how do I go about doing this? Well, in code, I can borrow the same technique that we've used a couple times now, even though it's uglier than some past approaches, malloc then an integer. How many bytes do you want? I don't know how big a node is. I could probably do the math and add up the integer and then the pointer. But, you know what, size of node is just going to answer that question for me. So this returns that chunk of memory that's big enough to store a node. And I'm going to store that just for temporarily in a variable called n, n for node, and that's going to just be a temporary variable, if you will. So, again, even though there's some new stuff going on here, this is just like before. Previously, I wanted to allocate an integer. Now, I want more than an integer. I want an actual node. And malloc returns an address, which means I must assign it to a variable. That's an address on the left hand side. All right, what should I always do? Slight spoiler because I clicked ahead a moment ago-- actually, we're going to forge ahead here. This is the ugliest thing we'll see. What is this second line of code doing here? What's going on here do you think? Yeah, what do you think? AUDIENCE: It's setting the number of that node to 2. DAVID MALAN: It is. It's setting the number of that node to 2. But why this crazy syntax, which we've never used before? Well, star n, we did see last week. That just means go there. The parentheses are just necessary for order of operations so that the compiler knows, OK, first go there. And then once you're there, what do you want to get access to? The number field. So use the same dot notation. So it's super ugly. But it's just doing two different things that we've seen in isolation. Go to the address in n, which is that chunk of memory. And then access the number field and set it equal to 2. Fortunately, C has some syntactic sugar, just an easier, prettier way of doing this. And it happens to look wonderfully like the actual thing we keep drawing-- this arrow notation. So if you ever see and you ever write this notation in C-- and I'm pretty sure this is the last new syntax we'll see-- this arrow, this very sort of hackish era where you hit a hyphen and then a greater than sign, this means the exact same thing as this. This is just annoying to type. It's ugly to look at. This is just slightly more pretty. And frankly, it's reminiscent of the pictures we've been drawing with the arrows pointing left and right. What's the next thing I want to do? After allocating this new node for the number 2, what do I want to put as well in that node? AUDIENCE: Put in the address. DAVID MALAN: Sorry, a little louder. AUDIENCE: The next address. DAVID MALAN: The address of the next node. But there is no next node yet. So what value could I use as a placeholder? AUDIENCE: Null. DAVID MALAN: Null. And so indeed, I'm going to do this arrow notation as well. You never need to do the star and then the dots and the parentheses. Everyone just writes the code like this in the real world. So n arrow next gets null. That now gives me that picture we were drawing. But, again, sanity check, if you ever use malloc, you should always check the return value. So just to be super precise, let me go ahead and add a couple more lines of code that just check if n is not null, go ahead and do the following. Conversely, I could check if n is null and then just exit or return depending on where I'm using this code. But you don't want to touch n and use this arrow notation unless you're sure n is not null. So what have I just done? My picture now looks like this. But this, of course, is not a linked list, because there's no linkage going on. I really need to do the equivalent of pointing an arrow from this pointer to this structure. I need to implement an arrow that looks like this. So how can we go about implementing that in code? Well, let me propose that this is what it ultimately looks like. We just need to draw that arrow. How do I do that? Well, it's as simple as this. If list is a variable, and it's previously initialized to null-- it's just a place holder-- and n is my temporary variable storing the new node, it suffices to say, well, lists should not be null anymore. It should literally equal the address of that chunk of memory I just allocated. And that's how we get this picture now inside of the computer. Now, let me do a couple of more operations. Suppose I want to add to the list the number 4. How do I add the number 4? Well, the number 4 is inside of its own node. So I have to go back to code like this. I need to allocate another node that installs the number 4 there. But that's not all. You don't want to just create the node, because it's otherwise out there in no man's land, so to speak. We now need to add the arrow. But now it gets a little non-obvious how you update the arrows, right, because I don't want to update list to point at 4, because that's going to sort of orphan, so to speak, number 2. And it just kind of float away conceptually. I really want to update 2's pointer to 4. So how can I do that? Well, you know what I can do is I can kind of follow these breadcrumbs. If I declare a temporary pointer-- and I'll do it using a little extravagantly last week like this little pointer notation-- if I'm a variable called temp, TMP, I can go ahead and point at the same thing that list is pointing at. And I'm going to check is this next value null? If it is, I found the end of the list. So really I can follow that arrow. Now, I know that I'm at a null pointer. So now, I just want to draw this number up here. And I accidentally advanced the screen. I want to actually draw this arrow up here. So how do we go about doing that? Well, the code there might look like this. So if all I want to point at a node, as I just did with the big fuzzy hand, I can just initialize this pointer to equal whatever the list itself is pointing at. Then, I can do like a while loop. And it's a little weird looking, because I'm using some of my new syntax. But this is just asking the question, while the next field I'm pointing at is not NULL, go ahead and follow it. So again, this is as complicated as the syntax today will get. But this is just saying, whatever I'm pointing at point specifically at the next field. While it is not NULL, go ahead and update yourself to point at whatever it is pointing at. So if I advance to the next slide here, this is like I'm initially pointing at 2. I see an arrow. I'm going to follow that arrow. I'm going to follow that arrow. So however big the list is I just keep moving my temporary pointer to follow these breadcrumbs until I hit NULL. So here in the story let me propose that we add another number, 5. And 5, of course, if we keep it sorted, it's got to go over here. And again, they're all over my computer's memory. They're not in a perfectly straight line, because who knows where there's available space. But now that I found this, I want to go ahead and create one more pointer using code very similar to what we just saw. But now lastly, let's do one more here at the beginning and then one more in the middle and see what can go wrong. What is worrisome about 1 if we actually want to store this list in sorted order? What might I be mindful of now if the goal is to insert 1 into this linked list? Any thoughts? What do I want to do first? Well, you know what, let me go ahead and just point-- you know what, it's obviously got to go to the start of the list if I want to keep it sorted, so that the arrows eventually go from left to right. So let me go ahead and just use code like this to allocate the new node. And let me go ahead and just move that arrow like this. This is wrong even though we've not seen the code for it. But why is this wrong? Yeah. AUDIENCE: You're orphaning 2, 4, and 5. DAVID MALAN: I'm orphaning 2, 4, 5. In what sense? I mean literally in my program, the only variables and the variables I have are those you see on the board here. So if nothing is pointing at 2 anymore, it doesn't matter that 2 is pointing at 4 and 4 is pointing at 5, we have orphaned 2 and transitively 4 and 5. So those are just lost. That is a memory leak. If you recall using Valgrind and getting yelled at by Valgrind because you're leaking memory, it might be because, yes, you've forgotten to free memory. Or worse, you might have completely forgotten where the memory is that you were using. And by definition of your own code, you can never access that memory again. You've asked the computer for it, but you're never able to give it back because you have no variable remembering where it is. So we don't want to do that. We instead want to do this probably. Let's point 1 to 2 first, which is kind of redundant, right? Now, we have sort of conflicting beginnings of the list. But once 1 is pointing to 2, what can your next update? AUDIENCE: The list. DAVID MALAN: List to point at 1. And you can do this in code if you'd like really with just two steps. You can update the next field of your new node, which is the one representing 1 that I just allocated, and you can initialize it to point at whatever list is pointing at. So if you want this thing to point at the same thing that this thing was pointing at, you literally just say in code n arrow x equals whatever list is pointing at and then you say the list should equal n itself. And again, you'll see in section this week and in the upcoming problem set actual opportunities to apply these kinds of lines of code. But those are the kinds of thought processes that you should be mindful of. Now, 3 is the only one that's particularly annoying. And we won't look at the code for this. If we actually want to put something in sorted order in the middle of the list, let's just consider conceptually what's got to happen. We've got to allocate memory for the node. We then need to update what? We probably don't want to point 2 at 3 for the exact same reason you identified. We then orphan 4 and 5. So what should we update first conceptually? AUDIENCE: 3 to 4. DAVID MALAN: Update 3 to 4, so it is going to look like this. And now we can update 2 to 3. And I'm going to wave my hand at the code for this only because there's multiple steps now. You have to probably have some kind of loop that iterates over the existing list, finds the appropriate location using less than or greater than, trying to find the right spot. And then you have to manipulate the pointers to do that. You won't need to do something as complicated as that for the upcoming problem set 5. But it is just boiling down to some loops, some inequality checks, and then some updates of the pointers. But it's easier generally to add stuff at the end and even easier to add things at the beginning, especially if you don't care about maintaining any kind of sorted order. Phew. Any questions on that? Yeah. AUDIENCE: Back to the beginning, like the code you had, what's the difference between node with star and like a pointer n of type node? DAVID MALAN: A pointer n of type-- let me just scroll back to the code, here? AUDIENCE: Yeah. DAVID MALAN: OK, so this is malloc is going to give us a chunk of memory that's big enough to store node. Node star n gives us a pointer that is the address of a node. And therefore we're going to assign the return value of malloc to that variable, so that n effectively represents a chunk of memory that's big enough to store a node. AUDIENCE: So n is node, not a pointer? DAVID MALAN: n is a pointer to a node. n is a node star, or a pointer to a node. And what does that mean? n is the address of a node. And that should make sense, because malloc returns an address. But this is why we're now using arrow notation. n is not a node. You can't do n dot number and n dot next. You have to do the star thing and then the dot. Or more succinctly now, you do an arrow number and arrow next. Good question. All right, let's see if we can't make this a little more real with maybe one demonstration here. Let me go ahead and put on the screen the end point that we want to get to, which is that here with everything in order, I think for this we need maybe five volunteers if we could. Let me go a little farther in back. OK, one over there. Maybe two over there. Now the hands are-- OK, 3 over here if we can go in back up over there. OK, 4 being volunteered by your friend. And 5 being volunteered by your friends. Do you want to come on up? All right, come on up. Come on up. Brian's going to help me run this demonstration. If all five of you could come on over, come on over here where we have some space. All right, let me get you some microphones and introductions. OK, thank you. Two of them were bravely volunteered by the people sitting next to them. So props to both of you. You want to say hello and a little something about yourself. AUDIENCE: Hello. My name [? Siobhana. ?] DAVID MALAN: [? Siobhana. ?] And year or-- AUDIENCE: I'm a sophomore. DAVID MALAN: Sophomore. OK. Nice to meet you. AUDIENCE: Hi. I'm a senior. DAVID MALAN: It's nice to have you too. Yeah. AUDIENCE: Hi, I'm Athena a sophomore in FOHO. DAVID MALAN: Athena. AUDIENCE: Hi. I'm Anurag. I'm a first year at Matthews. AUDIENCE: I'm Ethan. I'm a first year at Weld. DAVID MALAN: OK, and-- AUDIENCE: I'm Sarika. I'm first year in Thayer. DAVID MALAN: Wonderful. All right, thank you all for volunteering. Let's go ahead and do this. You for the moment represent a heap of memory, if you will. So if you could maybe all back up over here just to where we have some available space. We're going to need one of you to represent the list. Siobhan was it? AUDIENCE: [? Siobhana. ?] DAVID MALAN: [? Siobhana, ?] come on up. [? Siobhana, ?] do you want to go ahead and represent list. And to represent our actual list, we have-- or Brian-- yeah, we have a name tag, hello, my name is list. So you're going to represent the rectangle on the left that represents the linked list itself. And now initially we're going to go ahead initialize you to null. So you can just go ahead and put that behind your back. So you're not pointing at anything. But you represent list. And there's nothing in the list, no numbers in the list. What was the next step? If the goal at hand is to insert 2, 4, 5, 1, 3, we want to do what first, what lower level operation to get 2 in there? What was the first line of code? AUDIENCE: malloc. DAVID MALAN: malloc. So we want to malloc a node for 2. So let's go ahead and malloc. OK, come on up. So malloc. And what's your name again? AUDIENCE: Ethan. DAVID MALAN: Ethan. OK. And what do we need to give Ethan? Ethan has two values or two fields. The first one is the number 2. Thank you. The next one is a pointer called next. Now, you're not pointing at anyone else. So you'll put it behind your back. And now what do we want to do with? [? Siobhana, ?] what do we have to do? AUDIENCE: Point to-- DAVID MALAN: Point to? AUDIENCE: 2. DAVID MALAN: Him, yes, number 2. OK, so this now represents the picture where we have list here, 2 here, but the null pointer as well. All right next we wanted to add 4 to the list. How do we go ahead and do this? Well, with 4, we're going to go ahead and malloc. malloc, all right. And now, Brian has a lovely number 4 for you and a pointer. What do we want to do with your pointer? AUDIENCE: Not point it. DAVID MALAN: Not point at anything. Now, it's a little more work. And I need a temporary variable. So I'll play this role. I'm going to go ahead and point at wherever Siobhana is pointing at in sort of unnaturally this way. That's OK. We couldn't get hands that point the other way physically. So we're going to point at the same thing here. You're both pointing at 2. And what am I looking for in order to decide where to put 4? AUDIENCE: If it's greater than. DAVID MALAN: If it's greater than some value. So I'm going to check. Well, 4 is greater than 2. So I'm going to keep going. And your name was Eric? AUDIENCE: Ethan. DAVID MALAN: Ethan, sorry. So, Ethan, what are you pointing at? Nothing. So that's an opportunity. There's nothing to his right. So let me go ahead and have Ethan point at-- what's you're name again? AUDIENCE: Athena. DAVID MALAN: Athena. Also, unnaturally, but that's fine. And so now does Athena need to update her pointer? No, she's good. She represents the end of the list. So her pointer can stay behind her back. All right, let's go ahead and malloc 5. You want to be our 5? So now we need a 5. So we need to hand you the number 5. And what's your name again? AUDIENCE: Sarika. DAVID MALAN: Sarika. All right, so Sarika's holding the number 5. She also is going to get a pointer called next. What should Sarika be pointing at? AUDIENCE: Nothing. DAVID MALAN: Nothing. And now how to do I insert her into the right place? Well, I have to do the same thing. So I'm going to get involved again and be a temporary variable. I'm going to point at the same thing [? Siobhana ?] is pointing out, which is Ethan. I'm going to follow this and see, ooh, wait a minute, he's actually pointing at someone else. So I'm going to follow that. It's still number 4. So I want to keep going. Oh, wait a minute. Athena is not pointing at anyone. This is an opportunity now to have Athena point at 5 and voila. But are you going to change your pointer yet? No. Now things get a little more interesting. Could we go ahead and malloc 1? And what's your name again? AUDIENCE: Emma. DAVID MALAN: Emma. OK, Emma, we have the number 1 for you from Brian. You have a pointer, which should be initialized as well to null. And now we have a couple of steps involved here. What do we want to do first? What's your proposal? AUDIENCE: Temporary pointer. DAVID MALAN: Temporary pointer. So I'm going to point at the same things [? Siobhana ?] is pointing at, which is Ethan here. But I see that 2 is greater than 1. So what do I actually want to do? Well, let me incorrectly for a moment-- [? Siobhana, ?] could you point at number 1? What have we just done wrong? We've orphaned everyone else. And even more visibly now, no one is pointing at Ethan or beyond, which means we've just leaked that memory, never to be recovered or free. So we don't want to do that. Undo, Control-Z. What do we want to do instead? What's your name again? AUDIENCE: Emma. DAVID MALAN: Emma. What do you want to point at? AUDIENCE: I want to point at [? Siobhana. ?] DAVID MALAN: At that the same thing, [? Siobhana ?] is pointing at, which is equivalent then to Ethan. So go ahead and do that with your-- OK, sort of like Twister now. That's OK. And then [? Siobhana, ?] what I do want to point at? Perfect. So again a bunch of steps involved, but it really is just two or three steps depending on which pointers we want to update. And then lastly, let's go head to malloc 3. And your name was again? AUDIENCE: Anurag. DAVID MALAN: Anurag. So then we have a 3 for you from Brian. We have a pointer for you. It's initialized initially to null. So you can put that behind your back. I'm going to point at the same thing as [? Siobhana. ?] And here we go. 1 is smaller. 2 is smaller. 4 is larger. So let's get this right. And who do we want to point at whom first? AUDIENCE: 3 points at 4. DAVID MALAN: 3 should point at 4. So go ahead and do that. And you can step a little forward just so it looks a little less awkward. And then lastly, big finale, Ethan, who do you want to point at? Number 3. And thankfully, all these steps later, we have a linked list. We have wonderful souvenirs for each of you. We just need the numbers back. Thank you to our volunteers here if we could. OK, you can keep that. You can put the numbers on the desk here. So as these folks head off the stage, let me point out now one of the shortcomings of the approach-- thank you all so much-- of what we've just done here. Even though you all had the luxury of seeing 1 and 2 and 3 and 4 and 5 and you could even sort of immediately figure out where things go, even these linked lists are implemented with chunks of memory. And just like with arrays, or equivalently lockers, the computer can only look at one piece of memory at a time, which is to say that to a computer, that same linked list kind of looks like this. It's sort of blind to the specific numbers in that linked list until it opens each of those doors of memory. So to find where 3 goes or to find where 5 goes or to find where 1 goes, all of those doors maximally might need to be opened one at a time to find that value. So with linked lists we have gained a feature. We have gained the ability to add dynamically to the list, right. I just kept malloc-ing, malloc-ing, malloc-ing additional students and additional numbers. So the list can grow as big as we wanted it to. And that case, we had five. We could have done it 50 times or 500 to add more and more numbers. That's an upside, because we don't have to waste time inserting into an array by resizing it and moving all of the original contents. None of our volunteers had to move, technically speaking, just to insert a number 5 or number 3. They just had to point at someone else in memory or someone else on stage. So if your data structure now is a linked list that looks like this, we've paid a price for that dynamism. We've paid a price for the ability to resize our list without moving everything around that's already there. What is a downside that you might perceive of a linked list? What have we lost or given up here? AUDIENCE: We lost random access. DAVID MALAN: Sorry. Say again. AUDIENCE: We lost random access. DAVID MALAN: We lost random access. That's spot on. We've lost random access. Why? Because the way you get from the beginning of the list to the end is by following these pointers, following these arrows, these breadcrumbs, you can't just jump to the middle elements, even though obviously this one here on the screen to all of us humans, that's the middle element. You don't know that if you're the computer. If the main variable that's storing this data structure is the pointer, like [? Siobhana, ?] pointing to the first elements of the list, you're going to have to follow all of these arrows, frankly counting them up and then retroactively realizing, oh, there was 5. I passed the middle one earlier. You've glossed random access. And what algorithm have we used wonderfully in the past when we do have random access? AUDIENCE: Binary search. DAVID MALAN: Binary search. So we've lost the ability now to do binary search as efficiently as we once were able to. And so if we consider now the running time of linked lists, unfortunately, we've paid that price. Searching now has gone back up to linear. We no longer have logarithmic running time because of the fact that we're stitching together this data structure. And the only way to find the end of the list, the middle of the list is to follow all of these arrows. You can't just jump to one location. Meanwhile inserting into the list, in the worst case, big O of n is going to be linear as well, because you have to walk through the whole list to actually find a spot for the given number, especially if you're trying to keep it sorted. So it would seem that even though we've gained this feature of much more dynamic insertion and we're building up something more interesting in memory, and you can imagine this just taking much less time overall, because you have to keep moving everything around like we did with realloc, it's unfortunately something we're paying a price for. But that was a lot. Let's go ahead and take our 5-minute break. Fruit awaits outside. And we'll be back. All right, so during break, I whipped up one final example of our list program. This one uses all of those building blocks. And let's see if we can't follow along pictorially and code-wise what it is we just built with all of these humans on stage. So here is list list3.c. It's available online. So you can follow along at home afterward if you'd like. And let's just walk through the lines that are written for us in advance. One, I'm using standard I/O for printf. And I'm using stdlib for malloc and free, our new friends that give us dynamic memory. Here is the definition of a node that again has a number inside of it and a pointer, specifically a pointer to another node structure. So that's what each of our humans represented, this time now in C. What is my main program going to do? Just for the sake of demonstration, the goal at hand is just to write a program that initializes a linked list to nothing initially, then adds a node with 1, then adds a node with 2, then add a node with 3. We'll keep it simple and not add 4 or 5 this time. So how am I going to do this? Well, on line 17, I propose that we create a variable called list and have it be the address of a node. So if I were to draw this now pictorially, it's going to be just like our demonstration a bit ago, where I have a rectangle here called list. And initially, it's not pointing to anything. So I'm just going to leave the box blank to represent NULL. So that's that line 17 right here. Now, let me go ahead and do the following. Add a number to the list as follows. Line 20 just gives me enough memory for a node. And it stores that memory's address in a variable called n. Lines 21 through 24 are just a safety check. Did anything go wrong? If so, just return 1 and stop the program. We ran out of memory for some reason. But these two lines now should look a little more familiar. This now is going to go ahead and install 1 and NULL into that structure as follows. So let's recap. This line here 20 is the same thing as allocating really a node that looks like this in memory that has two halves. One of those fields is called number, which I'll write there. The other field is called next. And then if we go back to the code, these two lines are all about just installing values in that structure. So if I go ahead to number and put the number 1, I'm not going to bother drawing anything for next, because I'm going to leave it implicitly as NULL. So that's what's going on now. What do I next want to do? Well, the last line of code here under this comment that says add number to list, I set list equal to n where n again is pointing at this new node. So that's the same thing as saying, well, list is going to go ahead and point at that new node. So after those lines of code, I've created a picture in memory that effectively looks like this. Now, let's go ahead and add the number 2 to the list. It's almost the same. So here's the chunk of code that's going to go and add a second node to the list, this time containing 2. Let's do it step by step. Line 30, I'm going to reuse n as my temporary variable. So I don't have to re-declare it. It's the same n as before, but it's now going to get a different address of memory thanks to malloc. So that gives me another box like this that I'm going to go ahead and draw like that with nothing in it initially. I'm going to make sure per lines 31 to 34 that nothing went wrong. But that's just as before. And now in lines 35 and 36, I'm going to put 2 in there and NULL. So let me go over there and let me go ahead and put 2 in there. And I'm going to leave NULL blank implicitly. That's the end of the list. But now I, of course, conceptually have to link the node for 1 to the node for 2. And here's where C syntax, even though it's new, kind of finally makes sense. Notice here, I'm saying list arrow next equals NULL. That maps perfectly to the picture. List arrow x equals what? n. Well, n is this thing over here. So I just draw the arrow there. And so the code actually finally lines up even though it's new for today. So now I've drawn the picture as follows with 1 and 2. Let's go ahead and add a third and final node. This one containing the number 3, using these lines here. So line 40 gives me a new node with malloc. So that's going to give me a new node. I'll draw it as a rectangle over here. I'm drawing it left to right, but these things could be all over the place in memory. It doesn't matter where they end up. I'm going to go ahead and check as before that's it's not NULL, just to be safe. Then I'm going to go ahead and install the number 3 and NULL in there just as before. So that means let's go ahead and draw 3. I'm going to leave that blank because it's going to be NULL. And then the last line, you wouldn't typically hard code this or write this explicitly in a program. This is a bit more verbose than you need to. Let me propose that you would probably use some kind of loop instead and walk through the data structure step by step as I proposed earlier. But if we really want to do this just for demonstration's sake, notice, start at list, follow an arrow and go to next. Follow another arrow and go to next. We can literally do that with our picture. So here we go. Let me start at list and follow an arrow and go to next. Follow an arrow, go to next. And now this is NULL. So what I want to update is exactly this, as with line 47, which said follow two arrows, look at two next fields interchangeably and then set it equal to n. All right, so what remains here? Well, this program's whole purpose in life was just to print a list out. Here's a way where you can actually use a for loop to iterate over a linked list. It's kind of funky because we don't have i and ints and i++ and so forth. But a for loop doesn't need to use integers or i's. Remember that before the first semicolon, you have initialization. In between the semicolons, you have a condition. And then you have an update that happens over here. So you'll get more experience with this with Problem Set 5 ultimately. But for today's purposes, high level, notice that this gives me a temporary pointer, like my big red hand earlier. That's a node star pointer. And that's why I was able to point with the big fuzzy hand. And I set that equal to list. So whatever the list was pointing at so was my temporary fuzzy hand. I'm going to follow the following loop and so long as temp does not equal NULL. So earlier when I was wearing the big fuzzy hand, I kept pointing, pointing, pointing. And I stopped once it equaled NULL. So this is saying keep doing the following until it equals NULL. What do I want to do? I want to just print out the integer that's inside of whatever I'm pointing at inside of it's number field. So go to whatever I'm pointing at, follow the arrow, and go to the number field. That's how we get at the data inside. Once I've printed that out, for loops say that you just update a variable. So what is that variable temp equals temp arrow next. So if my fuzzy hand is pointing at someone and I need to update it to point at temp arrow next, that means go to whatever I'm pointing at, follow the arrow. There's the next field and point at whatever the next field was pointing at. So you just keep updating what you're pointing at. That prints out the list. And then-- and we'll defer this ultimately to Problem Set 5-- we will need to free this memory. And actually you have to be a little clever about how you free memory, but I'm going to use a while loop there, which turns out to be a little cleaner, a little easier, ultimately to free all of this mess I made in my computer's memory. I kind of need to do the equivalent of freeing things, but I need to free what's behind me, not what's in front of me. Once you free memory, you should not touch it, traverse it, and so forth. But again more on that final note in P Set 5. All right, any questions on a high level on the code? It's fine if it looks quite new. We make it available so that you have a starting point when it comes to using this kind of code yourself. Any questions? All right, so someone came up during break and noted that this actually seems to be a regression in that arrays gave us the ability to resize, even though it was a little expensive because you got to copy everything from one place to another. But we had random access and therefore binary search and therefore logarithmic running time for things like searching assorted lists. We seem to have given that up. Linked lists give us dynamism where we and shrink things without wasting time by moving things around. But we've lost random access. But you know what? Now, that we have this ability using pointers and data structures to kind of stitch things together in memory, connect things with arrows, you know what, we can build fancier things. Most of you are probably familiar with the idea of a family tree, which is this hierarchical two dimensional structure. And indeed, that's our inspiration here. What if we don't just keep making one dimensional data structures, arrays that go left and right, linked lists that kind of go left to right? What if we actually use a vertical notion too and lay things out more interestingly. What can we gain from this? Well, let me propose that anytime we've seen an array, we can actually re-implement an array, but get the best of both worlds, the best of arrays, the best of linked lists as follows. Here is an array, back from Week 1 or even Week 0 when we were searching behind doors. And here, Week 2, when we were searching behind doors, let's go ahead and note that if we were to do binary search on this looking for some value, as before, many times you look in the middle first. And then you decide, do you go left or right? And if you go left or right, you'd look in the middle element over here or the middle element over here. And then what do you do? You go left or right, looking at the middle element over here or over here or over here or over here. You know what? Let me just kind of explode this picture because all of this is happening in one dimension. We can actually think of this is happening really in two dimensions. Let me draw my same array, 1, 2, 3, 4, 5, 6, 7, but let me represent it on different levels that's indicative of what's happening. I start in the middle. And I go left or I go right. I then go ahead and look at this element. And then I go left or I go right. So it's the same thing, but it's a two-dimensional rendition of what we've been doing for a few weeks whenever we've done binary search. Well, you know what this kind of looks like? It kind of looks like a linked list, albeit without the arrows. But you know what, I don't think I want to stitch this together from 1 to 2 to 3 to 4 to 5 to 6 to 7, because that's just going to be a linked list. But what if I use my new-found familiarity with pointers, use a few more of them? So I spend more space and stitch this data structure together in two dimensions conceptually. Every node represented here is a rectangle. It doesn't have to have just one pointer. There's nothing stopping me from creating a new struct, a new definition of node that has two pointers. Maybe it's called left. Maybe it's called right. Previously, we had just one we called it next. But there's nothing stopping us from creating a fancier structure that actually has two. And so we might make it look not like this as before for a linked list, but let's get rid of the next pointer. Let's make a little more room. And let's actually give myself two pointers, left and right. And I claim that this structure now in C could be used to implement the tree that I just described, the family-like tree, more properly called a binary search tree, in the following way. This is a binary search tree. One, because every node in the tree has at most two children, hence the bi in binary, meaning maximally two. It has zero children, as like these down here. Or it has maximally two children. Hence, the bi in binary search tree. It's a search tree in the sense that I have taken care with this data to sort things properly. Notice the following definition. For any node in the tree, every element to the left is smaller than it. And every element to the right is greater than it. That's a recursive definition, because watch, look at this node. Everything to the left of it is smaller. Everything to the right of it is larger. Let's look at 6. Everything to the left of it is smaller. Everything to the right of it is larger. So it's recursive in the sense that no matter what node you look at, no matter what rectangle you look at, what I just said correctly is true of both the left child or subtree and the right child or subtree. So this is to say if you have a list of numbers, for instance, or a list of anything and you actually store them using nodes that look like this, but conceptually what you're really doing is stitching them together two dimensionally like this, guess what feature we just gain back? What have we just improved? I heard some murmuring over here. AUDIENCE: Binary search. DAVID MALAN: We've gotten back binary search. So we still have dynamism, like a linked list. We're still using pointers. And suppose we want to add the number 0 or the number 8, you could imagine 0 going over here, 8 going over here. So we could still just plug them in without having to move everything around like we would for an array. But because you're stitching things together with additional arrows wherever they are in memory, so long as you keep track of this data structure, called a tree, with one pointer to the so-called root-- the root being upside down in this world of computer science-- this is the root of this binary search tree, guess what you do if you're looking for the number 7? Well, you see 4. You know it's greater than 4. So what do you do? You move to the right, thereby ignoring the other half of this tree, just like the other half of the phone book in Week 0. Once you get to 6, you consider, I'm looking for 7. What do I know? It's got to be to the right. And so you go. The height of this tree happens to be logarithmic, for those familiar, log base 2 of n, which is to say I have 8 elements or 7 elements in this tree. But it only takes me 1, 2, 3 steps to find the value. It does not take big O of n, or a linear number of steps. And if you want your mind really to be blown here, it turns out this is actually the best application for recursion, which might have felt a little forced previously when we built Mario's pyramid with recursion where you did factorial or product or sum or something like that in section recursively. It turns out that now that we have data structures that exist conceptually in two dimensions that are recursively defined-- and by recursively defined, I mean for any given node, left is smaller, right is bigger, and you can make that statement about any node in the tree-- watch what we can do in terms of implementing binary search. If I have here a function called search, whose purpose in life is to return true or false if the number 50 is in the tree. How do you search a tree? Well, it takes as input the tree. More specifically, it takes the address of the tree. More specifically, it takes the address of the root of the tree. That is when you want to search a tree, you literally just hand it the address of the very first tip top node called the root. And from there, you can get everywhere else. Just like with the list, we just need the beginning of the list. So how do I go about searching a tree? Well, let's consider the easy case first. Suppose the address you're handed is null, what should you do if you're looking for 50, but you're handed the empty address, zeros? AUDIENCE: Return false. DAVID MALAN: Probably return false, right. If I hand you no tree and I say it's 50 in here, it's an easy answer. No, there's no 50, because there's no tree. So that's our base case, if you recall that nomenclature from our discussion of recursion. You hard code. You type in manually one explicit case that just gets you out of the program. Next case, if 50 is less than the tree, follow the arrow to the number field, then what do know? 50 is less than the node you're looking at. What direction do you want to go conceptually? AUDIENCE: To the left. DAVID MALAN: You want to go to the left. So this line here searches the tree's left child, so to speak, in the family tree sense, the left subtree. So if we go back to the picture a moment ago, if I'm looking for 50 in that story-- or let's make it more real, if I'm looking for 1 in the current story, I see that 1 is less than the current node. So I go ahead and just search the left subtree. And notice, this is a tree. But so is this if you look at it in isolation. And so is this. And therein lies the recursive opportunity. So again here, if 50 is less than the tree's number, then go ahead and search the left. Else if 50 is greater than the tree's current number, search the right. Else logically what must be the case if the tree exists and it's not less than and it's not greater than the number you're looking at? It must equal the number you're looking for, 50, in which case we can return true. But you recall perhaps from scratch, we don't really need that explicit case. We can just call it else instead. Any questions then on this use of code? We won't actually run this code. But this is how you can implement recursively an algorithm that is reminiscent of Week 0 searching for Mike Smith in the phone book, this time now searching a data structure that itself is recursive. All right, so what do we gain back in terms of running time, in terms of searching a binary search tree. To be clear, what's an upper bound on the running time? We're back to log n, which was the goal. And what about inserting into a binary search tree? This one we're going to defer to a higher level CS class, because it turns out you don't want to just go ahead and put 0 over there, and 8 over there, because if you keep doing that, putting smaller and smaller numbers or bigger and bigger numbers, you could imagine your tree getting very lanky, like very tall over here or maybe very tall over here and therefore not nearly as balanced as the tree we drew. And so it turns out there are algorithms that let you keep a binary search tree balanced. So even as you add elements to it, you kind of shift things around. You don't remove them in memory. You just update the pointer, so that the data structure itself does not get terribly high. But that too is log n, which means we had arrays, which gave us binary search capabilities in logarithmic time. We then introduced the linked list, which gave us dynamism, the ability to grow and, if we want, shrink. But we sacrifice binary search. But if we spend a little more space and use not one pointer for every node, but two, we can actually tip the scales again, spend more space and save time by searching the data structure, this time using something logarithmic. All right, so what would the ideal, though, be? Every time we talk about running time, it feels like we want to be low on this list and not high. n squared was slow. Big O of 1 is constant time. That's fast. Wouldn't it be nice throughout this story if we actually found our way to a data structure that gave us constant time? Like, my god, if we could just insert something into a data structure with one step and find something in a data structure with one step, that's sort of the holy grail, so to speak, because you don't have to worry about big O of n or big O of log n. You just jump immediately to the value you want. Well, it turns out, theoretically there's something that allows you to achieve that called a hash table. But how you implement that is not necessarily obvious. And it takes some expertise. And indeed, in Problem Set 5 among the goals at hand is to implement exactly this notion of a hash table that lets you spell check a document super fast. A word processing program would be so slow if every time you wanted to check a word for whether it's spelled correctly or incorrectly, if you had to search linearly or even longer rhythmically a big dictionary file, it might actually be really slow to spell check a file. But using a hash table, we can probably do much better. A hash table is a combination of an array and linked lists inside of it. So I'm going to go ahead and just for convenience draw my array, this time vertically instead of horizontally. But it's the same thing. And it's just an artist's rendition anyway. And suppose the goal at hand is to keep track efficiently of like a name tags. So maybe we're holding a big event. We've made some name tags in advance, which we indeed have. And we want people to be able to pick up these name tags super efficiently. It would be really annoying and pretty dumb if we just made a big stack and name tags, even if it's alphabetical, A to Z, then had everyone in the room line up and look through all of the darn name tags looking for their name. That's not a very inefficient system. Fortunately, we've come prepared with some buckets, all of which are labeled, because wouldn't it be nice if you're looking for your name tag, you don't look through the whole darn list of name tags or stack? You actually just go to your bucket. And you jump instantly to your name, where hopefully you're the only person with a name that starts with some letter. And then you can just reach in and get it. Well, how do we implement this conceptually? Well, it's very common with a hash table if the inputs are things like words or names to look at the characters in those words to decide where to put those names or those name tags, if you will. So here's an array of size 26, from 0 to 25. But, you know what, It's convenient to think of this array as maybe being indexed from A through Z. So still 26 buckets, but this array is really just of size 26, 0 through 25 ultimately. And suppose the goal at hand now is to go ahead and store these name tags in advance. So this is what the staff and I would do in advance. And, Brian, if you wouldn't mind helping out with this. The goal at hand is quite simply to get the name tags ready for students to pick up. And so where do I want to go ahead and put the first one? So Albus is the first one whose name tag we made. I'm going to go ahead and jump immediately to bucket 0 and put Albus's name right there in one step. Meanwhile I've got Zacharias, and so even though it's taking me a bunch of steps to go over here, if this is an array, I have random access, as a human, and so I can immediately, instantly put Zacharias over there. It's a little laborious for my feet, but a computer could just jump to 0 or 25 or anything in between. All right, so Hermione-- maybe you're noticing the pattern-- so Hermione is going to be H, or which is 7, which is going to be over here. Ginny is 6, which is over here. Ron is 17, which is over here. So think of each of my multiple steps taking actually one step. Fred is going to go over here. As an aside, the staff and I discussed this morning how we probably should've put the buckets closer together. But that's OK. Severus is going to go over here. Petunia is going to go over here. Draco is way over here, but doesn't matter, constant time, bracket 3. James is bracket 9. Cedric is bracket 2. Perhaps play this part in 2x speed. Luna is bucket 11. Neville bucket 13. Kingsley bucket 10. Kingsley, there we go. Minerva bucket 12. Vernon-- ironically, we don't actually need this many names to make the point we're trying to make. But Vernon-- we got a little carried away with the names we recognized. And now, the list is pretty full. All right, so that's a whole bunch of names. I filled up most of the buckets with a name tag. But-- why am I out of breath? But what's really convenient now is that if Cedric or Albus or Draco or Fred or Ginny come into the room, they can index instantly, randomly, to their pocket, get their name tag, and go. Nothing linear. They don't have to flip through the whole stack of name tags with which I actually began the story. But there's a problem ahead. We very deliberately ordered the name tags thus far in such a way that we don't create a problem for ourselves. But among the more famous characters we've not heard from yet is Harry. So Harry's name tag is still here. Where does this go? Well, Harry is going to go in bucket 7. But wait a minute, there's already someone there. So what do I do? If I were only using an array, Harry's kind of out of luck. Like Hermione is already in that location in the array. And we would have to decide, either Hermione goes there or Harry, but we can't just put them both. But if we implement this new data structure called a hash table using an array that's conceptually vertical, but that horizontally is a linked list, you know what, that's fine. We're just going to go ahead and link Hermione's and Harry's together. So, yes, it's going to take both of them or one of them at least two steps to find their name tag. But it's not going to take big O of n steps to find their name tag, at least if there's only two in this bucket. All right, Hagrid, dammit, so he came in the door too. So now that linked list is getting a little longer. We now have a chain, if you will, a linked list of size 3. Sirius is going to go over here in bucket 18. But Severus is already there too. Awkward. Remus is 17. Remus is going to go and link together with Ron there. George is going to go into bucket 6, which is over here. Lily is also going to collide, so to speak with Luna. And this is a collision in computer science. Anytime you have a value that you're trying to put in one place but there's something there, you need to resolve the collision somehow. So I'm proposing that we actually just link these together. Or as we're doing here, to bucketize values in computer science conceptually means to throw the value into a bucket, or physically as we've done here. Lucius finally is going to go in bucket 11 too. And lastly, Lavender goes in that same bucket. Phew. So thank you to Brian for helping choreograph that. So this structure that you're looking at is what is called a hash table. It is an array that you index into using what's called a hash function. A hash function is like any function that we've seen thus far, any program we've seen thus far-- something that takes input and produces output. So if we consider our original picture from Week 0 of what computer science in itself is when it comes to solving problems, hash function for today's purpose it's just this function, this process, this algorithm in between that decides, given a name tag, what bucket to put that name tag in. And quite obviously in the real world, what algorithm was I using to bucketize a name tag upon reading the name? AUDIENCE: First letters. DAVID MALAN: Looking at the first letter. Why? It's simple. It's pretty efficient. It means I can store a relatively small array of size 26 and just immediately put the name tags there. So in this case, we might have fed in Albus to that hash function. And it might return 0, representing A, if we're 0 indexing the array. Or for someone like Zacharias, we might get out 25 just because the first letter of his name is z. But this is kind of simplistic, right. And we've seen a problem. What is the problem with just looking, of course, at the users first letter of their name? What problem arose? Yeah. AUDIENCE: There might be more than one name with the first letter. DAVID MALAN: There might be more than one name with the first letter. And you know in the extreme-- and computer scientists and software engineers often think about the extreme. What is the corner case? What could go wrong? What if by chance there's just a lot of characters in this universe whose names start with h or l, and maybe all of their names just happened to start with h or l? It doesn't matter how fancy your hash table is, it's pretty stupid if all of the name tags are stacked up in a bucket. So in that sense, a hash table, even though this feels like it's pretty efficient, in the worst case, big O of n, when it comes to inserting and searching, because you could just get unlucky and get a huge stack of names that by nature of the class just all start with the same letter. So how can we mitigate this? How could we mitigate this? Well, you know what, rather than naively only looking at the first name, let's leverage some probabilities here. Why don't we look not at just the first letter, but maybe the first two letters? I bet if we look at the first two letters we're not going to get as many collisions as many people belonging to the same bucket. So Hermione, Harry, and Hagrid was a problem we identified earlier, not to mention a few other names. But that was because we were looking only at h for the hash function, only at the first letter in their name. What if instead we look at the first two, so we have a bucket for HA, HB, HC, HD, HE, HF? And so Hermione now goes in this bucket specifically. So we're going to need more buckets. And they're not pictured on the screen. And they're also not pictured here on stage. We need more than 26 buckets. Frankly, if we're looking at two letters, we need 26 times 26, like 676 buckets now. So more space, but we're hopefully going to decrease the probability of collisions. Why? Well, the next name I might put in here is Harry. He's going to end up in a different bucket this time. That's great, because it would seem that now I can get access to his name tag in constant time. Unfortunately, Hagrid is still in the story. And so we're going to have a collision with HA. So even looking at the first two letters is not ideal. So even though we have 676 buckets in this story, 26 times 26, which is a lot of buckets, we're still going to get collisions. So what would maybe the next evolution of this idea be? Well, don't look at the first letter, don't look at the first two letters. Why don't we look at the first three letters. Surely, that's going to drive down the probability. Unfortunately, that's going to drive up the number of cells in the array and buckets on stage to 10,000 plus buckets this time around. So that's a lot of buckets. But suppose we use not HA, but maybe HAA, HAB, HAC, HAD, HAE, HAF, HAG, dot dot dot, HAQ, HAR, HAS, dot dot dot, HEQ, HER, HES. So we have a lot of buckets and even more in between not pictured. Now we can go ahead and hash on Harry's name, Hagrid's name, Hermione's name. And this time, by design, they're going to end up in different buckets, which seems to be an improvement. And indeed, it is, because now if I go searching for Hermione or Hagrid or Harry's name tag, or they do themselves, they're going to be able to find it in constant time. But that's assuming there's not a lot of other kids with the name starting with H. And so a hash table still technically is big O of n because you could just get unlucky and have a big pile up of similar inputs, all of which produce the same output, even if you're using a fancier hash function like this. And there's a trade off too. My god, we're using like almost 20,000 buckets now just to store these names to speed things up. At some point, you know, it's probably cheaper to just let Harry and Hermione and Hagrid form a line and find their name tag more slowly. So there's this trade-off of time and space. But if you have what's called an ideal hash function and you figure out some magical algorithm written in code that ensures uniqueness that no name tag will end up colliding with another, then you can achieve this holy grail of big O of one time, constant time for a hash function. So it's this sort of tension between how much space do you want to spend? How much effort do you want to spend figuring out what that ideal hash function is? So in the real world, and we'll see this in Python, most computer systems give you a best effort, such that a hash table is not big O of n usually. It's actually, on average much much, much faster, even though there's a theoretical risk that it can be slow. And more on that too in a higher level CS course where you explore data structures and algorithms more formally. So technically speaking, it feels like search could get down to big O of 1, constant time, if every name tag ends up in a unique bucket. But you could still get unlucky if there's a lot of H names or L names or the like. So technically speaking, a hash table is big O of n. But, frankly, three names in a bucket, like Hermione, Hagrid, and Harry, is much better than n names in the same bucket. So even in the real world if you get rid of this asymptotic hand waviness, that's faster. That's much faster than putting everything in a linked list or an array itself. All right, so from there, I bet we can try one other approach here. There's another data structure we want to present to, not in code, but in pictures. This one's called a trie. Short for retrieval, even though it's pronounced differently. A trie is a data structure that actually is pretty amazing. And it follows this pattern of spending one resource to save on another. A trie is going to use a lot more memory, but it is going to give us actual constant time lookup for things like names or words being inserted into the structure. So what does it look like? It's a little strong, because we need to leave room for ourselves on the board with lots of memory. A trie is a tree, each of whose nodes is essentially an array. So notice the pattern here. Computer scientists over time have been kind of clever taking this idea, this idea, mashing them together and creating some monster data structure, but that gives you some savings of time or space. So this array at the very top represents the roots of this trie, which again is a tree whose nodes are arrays. And notice that the array is of size 26, for the sake of discussion, A through Z, or 0 through 25. A trie does this. If you want to store a name in a trie, what you do, in this case, is look at every letter in the word in question. So for Harry' it would be H-a-r-r-y. We're not just looking at the first, the second, and third. We're looking at all of them. And what we do is this. Suppose the first letter in the person's name or their name tag or the word more generally is an H. You go ahead and go to that index. And if there's no child node, there's no tree yet below it, another branch, if you will, you allocate another node. And another node just means another array. And so we've drawn two arrays on the board. This now has the letter A highlighted. All of the letters are technically there, because it's of course 0 through 25. But we're only highlighting the letters we care about for the sake of this example. Here is H-a-g. So it looks like the first name tag I'm trying to install into this data structure is Hagrid. Notice now that g is inside of that array. I want to go now to r for Hagrid. That gives me another array. Now i, now d. d is the end of his name. So I'm going to just color in green, or I can use like a Boolean flag in C code that just says someone's name ends here. So notice, I've implicitly stored Hagrid name now in this data structure by storing one node, that is one array, for every letter in his name. But there's this slight efficiency here because there's other people in this story besides Hagrid whose names are prefixes or share common prefixes. So, for instance, suppose I want to install Harry into this data structure. He is H-a-r-r-y. And so that gives me a couple of more nodes. And if I go ahead now and install Hermione in this, notice now I have even more nodes in the tree. But some of them are shared. If you start at the very top and look at the H, notice that both Hagrid and Harry and Hermione at least share at least one node in common. Now, what's cool about this ultimately? So what is the running time of searching for someone in this data structure if there's n people already in it? Right now n equals 3 because there's three people in it, even though there's a lot of nodes. But what's the running time for searching this data structure to see has Harry picked up his name tag already? Has Hermione picked up hers? Has Hagrid picked up his? Well, how many steps does it take to find Harry or Hermione or Hagrid in this data structure? For Harry, it's H-a-r-r-y. So it's five steps maximally. For Hagrid it's H-a-g-r-i-d. It's six steps maximally. And H-e-r-m-i-o-n-e, 8 steps total. And it's probably the case that if we read through the books, there is going to be some upper bound on the length of someone's name. I don't know what it is. It's probably 20 characters. Maybe 30 if it's crazy long. But there is some fixed value. Anytime you have a fixed value, that's what you by definition in CS and in math call a constant. If it's 20, it's 30, it doesn't matter. But it's fixed. People's names aren't growing every year in length. There's some hard upper bound. And so technically, if it only takes you five steps or six steps or eight steps to find Harry or Hagrid or Harry or Hermione, that is technically constant time or, as we've said, Big O of 1. So we can actually then achieve, truly for searching this data structure, for inserting this data structure, truly what we call big O of k where k is some constant. But a constant is the same thing, asymptotically, per our discussion in Week 3, of big O of 1. These are effectively constant time, because to find Harry, you look only at H-a-r-r-y. It doesn't matter if there's 1 million other characters in that trie already. It doesn't matter if there's Hermione and Hagrid and everyone else from the seven books in the data structure, because the only nodes you're looking at are the ones representing H-a-r-r-y. And that's a powerful thing. Every other algorithm we've discussed thus far, certainly for searching and sorting, has somehow been slowed down by how many other names or numbers are in the data structure. That is not the case for this one here. However, there is a price being paid. What appears to be the price that we're paying to gain that really low running time? AUDIENCE: Memory. DAVID MALAN: Memory. I mean, my god, it barely fits on the slide. And this is just three names. You're spending 26 amount of the memory to store one character. Now there's some optimizations. Over time, if you insert a lot of names, some of these nodes will be shared. But this is a very wide, very dense data structure, so to speak, because it's using so much memory to give you that super amazing running time of theoretically constant time. So again this theme of trade-offs is going to persist in the remaining weeks of the semester where to gain one resource, we're going to have to spend another. So that there is a trie. So it turns out that now that we have arrays and linked lists and trees and tries and hash tables and yet other data structures out there, we can actually implement what are called abstract data structures, using any of those as building blocks. What we've kind of done today verbally and pictorially is invent more of those pink puzzle pieces in Scratch, those custom puzzle pieces. Now we have as building blocks arrays and linked lists and trees and hash tables that we can use to solve other problems. And one of the problems out there in the real world is something called a queue. A queue and certainly in certain cultures immediately comes to mind, what's a queue in the real world or an example thereof? AUDIENCE: A line. DAVID MALAN: So a line, right, lining up at a store or a restaurant or a takeout place. So a queue actually has a technical meaning and computer science too. It's a data structure that is FIFO, first in, first out. A queue, by definition should have people hopefully pleasantly lining up one person in front of the other. And it maintains this FIFO property, first in, first out, such that if I'm at the front of the line I am going to be served my food first and then the person behind me and then the person behind them. It'd be really obnoxious if you walked up to Tasty Burger, placed your order, and whoever showed up most recently got their food first. That would be an opposite data structure. That's LIFO. Last in, first out. Not fair in the real world. So you might hope then that the software that companies like Tasty Burger use when they type in your order to the system actually send those orders to the team working in back cooking the food in a queue fashion, because it'd be pretty obnoxious too if people behind you were getting their food first. So hopefully in software, you're implementing that real world notion of a queue as well. Printing, if you still print on campus sometimes, papers and whatnot on printers, they're often shared printers on campus. And so they have what are called printer queues. You might go to Command-P or Control-P print, but then hopefully, in fairness, if there's 10 people who are trying to print to the same Harvard printer, they are printed in the order in which they were requested. It would be pretty obnoxious, again, if the order were flipped. Well, it turns out with queues in computer science, there's two fundamental operations, even though we humans don't really think in these terms, enqueue and dequeue. To enqueue means to get in line. To dequeue means to get out of line, hopefully once you've been served your printouts or your food or whatnot. Using today's principles, arrays, linked lists, you could probably imagine conceptually using them as building blocks to implement this notion of a queue. The software that Tasty Burger or any fast food place uses probably has implemented in code some lines that are using an array that's maybe being dynamically resized or better yet a linked list that's growing and shrinking as people are placing orders and getting orders. So there's this one-to-one mapping between some of today's ideas and even the real world as well. There's kind of the opposite data structure that I referred to a moment ago. And these are generally known as stacks. Stacks in the real world might be in the dining hall right. Like here is of trays. And they have this fundamentally different properties, such that if the staff go ahead and clean the trays and put them right here, it would be pretty obnoxious if to get your you had to go through a FIFO fashion and get the first they put down and take that out first. No one does that, realistically. If you've got a big stack of trays in the dining hall, you probably enforce a LIFO order, last in, first out. So if this was the most recently installed or clean tray, you're probably, as the human, just going to take the top one, even though that's not really fair to the below. But it doesn't matter in this particular case. So a stack gives you the opposite property. And where else might you see these? Well, your Gmail inbox. If you use Gmail, your inbox, most likely by default is configured as a stack. Where do your most recent emails end up? AUDIENCE: At the top. DAVID MALAN: At the top. Now, this is wonderful because it's a feature in that you always see your newest mail. Kind of a downside, though, to your friends who've emailed you an hour ago and whose emails are now down here are on page 2 of your email. So there's trade-offs here too. Stacks might have desirable properties, like just get your tray quickly, see your most recent email. But if you're like me, as soon as email falls on the page 2, you might never get back to it if the stack of trays never actually gets exhausted. Frankly, there might be in some dining hall on campus some way down here that has never been used in years, because they keep refilling the stack, and we keep taking from the top. So that would be a bad property for a lot of context, but not necessarily all. So it turns out there's another data structure too-- oh, and the operations there are not called enqueue and dequeue. By convention they're called push and pop, where this means pushing an element onto the stack. Even if it's very gentle, that's pushing. Popping means removing the top element. So it's just nomenclature meaning adding and removing elements. But there's one other data structure we'll give mentioned to today. And that's known as a dictionary. And we'll see this again in a couple of weeks when we look at Python. A dictionary is the abstraction that you can get on top of a hash table. This hash table literally involved physical buckets and in code would involve arrays and linked lists. That's like low level plumbing. A dictionary more generally in computer science is a data structure that has keys and values, words and values, words and page numbers, anything that maps one thing to another. Physical dictionaries in the human world, like an English Dictionary, has lots of words. And if a word is correctly spelled in your document, it will be in that dictionary. And if you have a typo, a misspelling, in your document, it will not be in that dictionary. So wouldn't it be nice if you could actually implement a dictionary using maybe a hash table, but a smart hash table that has plenty of buckets, so that you can answer a question, is this a word, is this a word, super fast without having a whole stack of name tags or, in this case, English words all in the same bucket. And, in fact, that's the challenge for Problem Set 5. We're going to give you a big text file with 140,000 plus English words. And the goal for you is to implement a hash table with your choice of number of buckets, your choice of hash functions, and implement this notion of an array with linked lists that stores those 140,000 plus words. Dictionaries, though, do exist in the real world. And taken last night at like 9:00 PM before Sweetgreen closed in Harvard Square was this photo. If you've ever ordered a salad at Sweetgreen, they have a pretty clever optimized system so as to pick up your salad. If you order on their app in advance, they go ahead and put your salad under D for David, for instance, or B for Brian and so forth. So that when you go into the store, you don't have to look through big O of n other salads. You can jump immediately to the B section, the D section, or any other section and get your salad. Now, in the extreme case, maybe Harry and Hermione and Hagrid all order at the same time. So there's just a big stack at the H's. So it's technically still big O of n. But if you assume a nice uniform distribution of names, this probably does work out pretty well, especially if the salads aren't there by design very long. But let's use our final minutes together to take a look at one final visual, one made by some of our other friends online who put together an animation that tells the story of the differences between stacks and queues personified as follows. [VIDEO PLAYBACK] NARRATOR: Once upon a time, there was a guy named Jack. When it came to making friends, Jack did not have the knack. So Jack went to talk to the most popular guy he knew. He went up to Lou and asked, what do I do? Lou saw that his friend was really distressed. Well, Lou began, just look how you're dressed. Don't you have any with clothes with a different look? Yes, said Jack, I sure do. Come to my house and I'll show them to you. So they went off to Jack's. And Jack showed Lou the box where he kept all his shirts and his pants and his socks. Lou said, I see you have all your clothes in a pile. Why don't you wear some others once in a while? Jack said, well, when I remove and socks, I wash them and put them away in the box. Then comes the next morning and up I hop. I go to the box and get my off the top. Lou quickly realized the problem with Jack. He kept clothes, CDs, and books in a stack. When he reached for something to read or to wear, he chose the top book or underwear. Then when he was done, he would put it right back. Back it would go, on top of the stack. I know the solution, said a triumphant Lou. You need to learn to start using a queue. Lou took Jack's and hung them in a closet. And when he had emptied the box, he just tossed it. Then he said, now Jack, at the end of day, put your clothes on the left when you put them away. Then tomorrow morning, when you see the sun shine, get your clothes from the right, from the end of the line. Don't you see, said Lou, it will be so nice. You'll wear everything once before you wear something twice. And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself, all thanks to Lou and his wonderful queue. [END PLAYBACK] DAVID MALAN: All right that's it for CS50. We'll see you next time.
B1 memory david malan malan node pointer list CS50 2019 - Lecture 5 - Data Structures 10 0 林宜悉 posted on 2020/03/28 More Share Save Report Video vocabulary