Subtitles section Play video Print subtitles SPEAKER 1: All right, this is CS50 and this is week five. And let's take a look at where we left off last time. You may recall this guy here, Binky from our friends at Stanford. And we used Binky to start talking about pointers. What is a pointer? So, a pointer is just an address, the location of some piece of data in memory, because recall at the end of the day your computer just has a few pieces of hardware inside of it, one of which is RAM or Random Access Memory. And in RAM you have the ability to store bunches and bunches of bytes, or kilobytes, or megabytes, or gigabytes, depending on how much memory you have. And if you assume that no matter how much RAM you have you can enumerate the bytes-- this is byte 0, this is byte 1, this is byte 2, and so forth-- you can give each of the bytes of your computer's memory an address and those addresses are simply called pointers. And now in C we have the ability to use pointers both to go to any location in memory that we want and even to dynamically allocate memory in case we don't necessarily know a priori how much memory we might need for a program. Now, in terms of your computer's RAM, recall that we divided the world into this picture here whereby if this rectangular region, arbitrarily, represents your computer's memory, here is how the computer divvies it up when you're actually using a program. At the bottom of your computer's area of memory, you have the so-called stack. And recall that the stack is where any time you call a function, it gets a slice of memory-- a frame of memory, if you will-- for all of its local variables, all of its arguments and anything else that it might need. On top of that might go another slice or frame of memory if that first function calls another. And if that second function in turn calls another function, you might have a third frame on the stack. Of course, this doesn't end well if you keep calling function after function after function after function. And so, hopefully you don't accidentally induce some kind of infinite loop such that these frames pile on top of each other infinitely many times, because eventually they'll run the risk of hitting the heap. Now, the heap is the same type of physical memory. You're just using it in a slightly different way. The heap is used any time you want to dynamically allocate memory, when you don't know in advance how many bytes you need but you do know once the program is running how many you now want. You can ask via functions like malloc the operating system for some number of bytes, and those bytes are allocated from the heap. So, those two have addresses or numbers. And so, the operating system, by way of malloc, just figures out which of those bytes are not yet being used so that you can now put whatever piece of data you have in that particular place. Now, beyond that [? appear ?] things like initialized data, uninitialized data. That's where things like global variables that are initialized or not end up that might be outside of your main function. And then above that is the so-called text segment, which are these zeros and ones that actually compose your program. So when you double click an icon on Windows or Mac OS to run a program or you type dot slash something in the Linux command line environment in order to run a program, the bits that compose your program are loaded also into memory up into this region here. So, at the end of the day, you have access to just pretty generic memory, but we use it in these different ways. And it allows us to ultimately solve problems that we might not have been able to in the past. Recall for instance this example here, deliberately shown in red because it was [? buggy. ?] This does not work. Now, logically, it does do the swap that we intend whereby a goes into b and b goes into a. And we achieve that result by way of this temporary variable so that we have a temporary placeholder into which to store one of those values while doing the swap. But it had no permanent impact on the two variables that were passed into it. And that was because by default in C any time you pass arguments to a function, those arguments are passed so to speak, by value. You get copies of those values being passed into a function. And so, if main, for instance, has two variables, x and y-- as they did last time-- and you pass x and y into a function like this one here swap, x and y are going to get copied as a and b respectively. So you might perfectly, logically, correctly swap a and b, but you're having no permanent impact on x and y themselves. But what if, per this green version here, we reimplement swap to be a little more complicated looking, but at the end of the day actually correct? Notice now we've declared a and b not to be integers but to be pointers to integers, the addresses of integers. And that's what's implied by the star that we're putting right there before the variable's name. Meanwhile, inside of the body of this function, we still have three lines of code. And we're still using a temporary variable, and that in itself is not a pointer. It's just an integer as before, but notice we're using this star notation again, albeit for a different purpose to actually dereference these pointers. Recall that int star a and int star b means give me a variable that can store the address of an integer. That's declaring a pointer. Meanwhile, if you just say star a without declaring something to the left of it with a data type like int, you're saying go to the address that is in a. So if a is an address, star a is at that address, which of course per its declaration is going to be an integer. Similarly, star b means go to the address in b. Star a means go to the address in a and put the former into the latter, ultimately putting the value of temp at the address in b-- so absolutely more complicated at first glance, but if you consider again the first principles of what's going on here, all we are doing are moving things around in memory. And we can do that now because we have the ability to express the locations, the numeric locations of where things are in memory. But nicely enough, we, the programmer, don't have to care where things are in memory. We can access things symbolically as we're doing here with a and b. So even though we might have seen on the screen or you might see while debugging actual addresses of memory, rarely does that actually matter in practice. We can deal with everything we've learned thus far symbolically. Now, last time we also took a look at the world of forensics, and we took a look at how images are implemented and specifically file formats like BNP, and JPEG, and GIF, and yet others. And we glanced into [? Asmila's ?] here as we tried to enhance this image, but of course, there was only finite amount of information. So, what you see is what you get in terms of any kind of glint or suspect in her eyes. But we did this in part so that we could also introduce another feature of C that allows us to declare our own data types, indeed our own data structures. For instance, we proposed that if you wanted to write a program that stores a student, you could actually declare your own student data type inside of which is a name and inside of which is a dorm, and anything else that you might actually want. Meanwhile, this syntax here gives us a new data type called student so that if we want to write a program that implements students, we can actually wrap related information together like name and dorm without having to maintain a whole bunch of strings for just names and a whole bunch of strings for just dorms. We can actually encapsulate things all inside of one structure. And indeed encapsulation is another principle of computer science that you'll see throughout program and throughout the field itself. So, what do we now do this time? So, today we introduce more sophisticated ingredients with which we can solve problems and we revisit a problem from the past that we thought we had rather knocked off and had solved. So, this might represent a whole bunch of names, a whole bunch of numbers, a whole bunch of telephone numbers in a phone book back to back to back to back stored in this case in the form of an array, the simplest of data structure, so to speak, that we've discussed thus far. And an array, again, is a contiguous block of memory each of whose element-- typically are of the same data type, integers, or strings, or the like-- and they are by definition back to back to back to back, which allows you random access. Which means you can jump to any of these locations instantly just by using in C that square bracket notation or as we saw last time using pointer arithmetic, actually using the star operator and maybe adding some number two and address to get at some subsequent address. But it turns out there's a few problems with this fundamental approach. Nice and as simple as it is, it would seem that we rather paint ourselves into a corner with this approach. This array has 1, 2, 3, 4, 5, 6 total elements, at least as depicted here. So that's fine if you want to insert a number, and then another number, and then four more numbers. But what if you want to then insert a seventh number, not to mention an eighth number or a ninth number or the like? Well, where do you put them? Well, you might think, well, that's fine. I'm just going to go put the seventh number over here, or the eighth number over here, or the ninth number over there. But you can't just blindly do that. If this memory is being managed not by you per se but by malloc and by the computer itself inside-- and your program, this memory over here, while it might physically exist, might be used by some other part of your program all together. It doesn't necessarily belong to you unless you've asked for it. And the problem with an array is that as we've seen it typically you declare their size in advance, as with the square bracket notation, and say give me six integers or give me six something or others, but that's it. You have to decide in advance. You can't just grow it as you can in some programming languages thereafter. You've rather painted yourself into a corner. But with malloc and other functions like we saw last time, you can actually allocate more memory using malloc. Unfortunately, it might end up in another location in your computer's memory, so you might have to do some copying to take the original six elements and move them elsewhere just to make room for more. And there is a data function for that, something called [? re-alloc ?] or reallocate. And indeed it can do exactly that. It can give you a bigger chunk of memory and reallocate what was previously there to be a larger [? size. ?] But you have to do a little bit of work. You have to invoke it in order achieve that. You can't just blindly keep adding things at the end of this array. Now, unfortunately, while a solution that might not be very efficient. Even if you can allocate a bigger chunk of memory that's bigger than six because you have more numbers, for instance, to store, what if that takes a bit of time? And indeed it's going to. If you allocate more integers somewhere else in memory you still have to copy those original values, and now it just feels like you're wasting time. Now, instead of just inserting things into the list, you might have to copy it into a bigger space, reallocate things, grow. It's a lot more work. And all of that discussion of running time and performance comes back into play, because if that whole copying process and reallocating is costing you time, your algorithm or your program ultimately might not really be as fast as you might want it. So, what could we do instead? What could we do instead in order to solve this problem dynamically, so to speak, that being the operative word. And luckily enough, last week we learned that there is dynamic memory allocation in C by way of that function malloc. And we also learned that there is ways of representing structures in C that you don't necessarily get with the language itself, because they're not primitives. They're not built in. In other words, let me propose this as a solution to our problem. This is a list of, let's see, five numbers it would seem, 9, 17, 22, 26, and 34. Pretty arbitrary right now, but you might imagine very simply drawing those same numbers-- 9, 17, 22, 26, 34-- in the form of an array and they're clearly deliberately sorted. But again, what if you wanted to grow that array or even shrink that array dynamically over time? Well, let me propose that we not draw those numbers back to back to back to back literally next to each other but allow ourselves potentially a little bit of space? But if that's the case and nine is here in my computer's memory and 17 is here and 22 is here, or over here, or over here-- in other words, what if I relax the constraint that my numbers or my data types more generally have to be stored contiguously back to back to back to back in memory and instead allow them to be anywhere, indeed anywhere a function like malloc wants to give me more memory, that's fine. If it wants to give me memory up here in my computer, I'll deal with that. If it wants to give me extra memory over here, that's fine. I'll deal with it, because I'll use these conceptual arrows to stitch together my data structure this time. And now, where have we seen these kinds of arrows before? What feature of C allows us to connect one thing to another where a la chutes and ladders get from one place to another? Well, that's exactly what we saw last time which was pointers. While we've drawn these here per the snippet from a textbook using arrows, those are really just pointers. And what does each of these rectangles represent? Well, clearly a number in the top half of the rectangle, but I claim that at the bottom half of these rectangles let's consider that bottom rectangle to just be another piece of data, specifically an int star, a pointer. Or rather not a pointer because it seems to be pointing not just to the number but to this whole rectangle, so I need some new terminology. I need some kind of structure to contain an integer and this pointer. And for that, I think I'm going to need a struct. And indeed let me propose that to solve this problem we give ourselves this building block as a new C data type called a node. You can call it anything you want, but the convention would be to call something like this in a data structure-- that's like a puzzle piece or a building block in a data structure would be called a node. Let me propose that we define it as follows. I'm using that same syntax from last time with which we declared a student data type, but here I'm saying inside of this data structure, this node shall be an int. And that's pretty straightforward. Just like a student might have a name and a dorm, this node will have an int called n arbitrarily. And then the only piece of detail that's a little bit new now is the second line, struct node star next. Now, what does that mean? It's pretty verbose, but struct node is just recursively, if you will, referring to this same type of data structure. Star means this is going to be a pointer, the address of one such thing, and next is just an arbitrary but pretty reasonable name to give to such a pointer. So this line here, struct node star next, is the incantation in C with which you declare one of those arrows that will point from one node, one rectangle to another node, another rectangle. And the fact that we have a little bit of additional verbiage up here, typedef struct node, is because again C is a language that is read top to bottom, left to right, so words have to exist before you actually use them. So, whereas last time when we declared a student, we didn't actually mention struct student or anything like that. We just said typedef open curly brace. Today, when declaring a node, we actually have to have some additional syntax here just called struct node. And technically this word could be anything, but I'll leave it as node for consistency. And that allows me inside of this definition or to specify that the second data member is going to be a pointer to exactly that kind of data structure. But typedef, just to be clear, allows me to type a smaller name for this data structure here, which I will simply called node at this point. So, what can we actually do with this kind of data structure now? And, indeed, let's give this data structure a name. Let's start calling a linked list. Previously, we had arrays, but now we have linked lists, both of which at the end of the day are types of lists, but linked lists, as the name suggests, are linked or threaded together using pointers. Now, when you have a linked list, what might be some operations, some algorithms that you might want to run on them? Well, if you've got a linked list of say numbers, for the sake of discussion, you might want to insert a new number into that list. You might want to delete a number from that list and you might want to search that list. And that allows us to then consider how we might implement each of these kinds of things. But it turns out while all simply-- while fairly simple intuitively, we're going to have to be a little careful now by way of our pointers. So, let's more formally declare a linked list to look something like this. It's a collection of nodes that are linked together with pointers as represented by these arrows here, but we're going to need some special pointer, at least at the beginning of the list. Let's just call it first. It doesn't necessarily store a actual integer. It itself first is just a pointer to the start of the list. And by way of that pointer can we access the first actual node in the list. From there can we get at the second, from there can we get at the third, and the fourth, and the fifth, and any number of others. And this syntax over here might just represent null. Because you don't want to have that pointer just pointing off into no man's land, that will have to be a null pointer so that if we check for that with a condition we know, OK, we're at the end of the list. So, let's pause for just a moment and consider these three algorithms-- insert, delete, and search, and consider what's going to be involved. Well, how would you go about searching for an element of this list? Suppose I wanted to find the number 22? What do you do? Well, me, I, the human can just look at this and be like all right, 22 is right there. But a computer can't do that. A computer every time we've had this discussion can only look at one thing at a time. But moreover the computer this time is even more constrained because it can't just use our old friend binary search or divide and conquer, because how do you get to the middle of a linked list? Well, you have to find your way there. The only thing you have in a linked list from the outset is one pointer called first or whatever it is, but one pointer that leads you to the beginning of the list, the first node in the list. So, if you want to get to the second node in the list, you can't just go to bracket one, or bracket two, or bracket three to get any number of other elements in the list. You have to follow these bread crumbs, if you will. You have to follow these arrows or these addresses to go from one node's address to the other to the other. And so, we've paid a price already. And we'll see that there is still an advantage here, but what's the running time? What's an upper bound on the running time of search for a linked list, even if it is sorted? Any thoughts? Is it constant time like big O of 1? Is it log of n? Is it n, n squared? What's the running time going to be? Well, they're sorted, and that was this magical ingredient, this assumption we've been allowed to make in the past which was helpful, but that assumed that we had random access. In C, we had square bracket notation, so that using some simple arithmetic we could jump roughly to the middle, and then the next middle, and the next middle looking for Mike Smith or whatever element it is we're looking for. Unfortunately here, one price we have already paid already by taking this step toward linked lists is linear time. Big O of n would seem to be the running time of searching a linked list, because the only way you can start is at the beginning, and the only way you can get through the list is by following these arrows. And if there's n nodes in the list, you're going to need as many as n steps to find, something like 22, or 26, or 34, or any elements all together. Well, that's not all that great. What about insert? What's an upper bound on the running time of insert? Well, here too it depends. Suppose that we don't care about keeping the list sorted. That's kind of a nice advantage, so I can be a little lazy here. So, what's the running time going to be if I want to insert a new number like the number 50 into this list, but I don't care about keeping it sorted? Well, instinctively, where would you put this element? Where would you put it? You might be inclined-- you kind of want to put it over here, because it's the biggest element. But again, if you don't care about keeping it sorted, where is the fastest, the quickest and dirtiest place to put it? I would propose let's just put it at the front of the list. Let's take this first pointer, point it at the new number 50 that we've have somehow added to the picture as by calling malloc, asking malloc for a new node. And then have 50, in turn, point to the number 9, and then 9 can point to 17, and 22, and so forth. What if we want to insert another number, 42, and we don't care about where it goes? Well, why don't we just put it at the beginning of the list? Then we have the first pointers pointing at 42, which in turn should point at 50, which in turn can point at 9, then 17, then 22, and so forth. So, if we're just lazy about this, we can actually achieve a great running time for insert constant time. Unfortunately, if we want to keep things sorted then we're going to have to incur a linear time cost again, right? Because if we have to insert 42 or 50, worst case they might belong all the way at the end of the list and that's Big O of n steps. And delete, too, unfortunately, whether it's sorted or unsorted is also like search going to be Big O of n because you don't necessarily know when you're searching for a number to delete if it's going to be at the beginning, the middle, and the end. So, in the worst case, it might indeed be at the end. You know what? Why don't we instead of walking through this verbally, let's see if we can't get some volunteers? Can we get seven volunteers to play-- wow, to play the role of numbers here. 1, 2, 3, 4, 5, 6, and yes, 7, come on up. All right, so I have here some printouts for all seven of you that represent exactly the nodes that we have here on the screen. Let's meet one of our first contestants. What is your name? AUDIENCE: Scully. SPEAKER 1: Scully, nice to see you. So, you shall be literally first and represent our first pointer. So, if you want to come and stand roughly over here. And then what is your name? AUDIENCE: Maria. SPEAKER 1: Maria, nice to see you. And you can be the number 9 right next to our first contestant. And your name? AUDIENCE: Sarah. SPEAKER 1: Sarah, nice to see you. You shall be the number 17. And your name? [? AUDIENCE: Satoshi. ?] [? SPEAKER 1: Satoshi, ?] nice to see you. You shall be 20. And your name? [? AUDIENCE: Mosof. ?] [? SPEAKER 1: Mosof, ?] nice to see you. And you shall be 22. AUDIENCE: Jed. SPEAKER 1: Jed, nice to see you-- 29, formerly 26. And your name? AUDIENCE: Erin. SPEAKER 1: Erin, nice to see you. You shall be 34. All right, so what we have here is seven elements, six of which are very similar to themselves, one of which is fundamentally different. So, Scully here represents first, and indeed her sheet of paper is horizontal to suggest that she is just a node. She is just going to be the pointer of to a node in a list. Everyone else's nodes are vertical, as have been the rectangles we've been drawing on the screen, because each of these guys represents a number as well as a next pointer. Now of course, you're only seeing in front of you the number. So we're going to go ahead and if you wouldn't mind, use your left hand to represent the arrow that we've long had on the screen to point to the person next to you. And Erin, you're a bit of an anomaly, but also because you need to have a null pointer at the end of the list, so that you're not just pointing aimlessly. And pointing to the ground seems fine, so literally pointing to the ground will represent-- will infer as null. So, Scully, you are the only thing keeping this list together, so to speak. So, you two need to point with your one pointer to Maria there. So, here we have a linked list. And just to make this clear, could everyone separate from each other by a step or two in any direction? Notice that the list is still-- it's almost identical to before. Can some of you take a step forward, a step back, but still point at the other person? So, now we're capturing a little more accurately the fact that these nodes and these pointers can be anywhere in memory, so long as they're linking themselves together by way of these pointers. All right, so suppose now that I want to insert an element like 55, which happens to belong at the end of this list. Let me go ahead and malloc Christian if I could. So we have asked malloc for a chunk of memory equivalent to the size of one integer and one pointer. That is going to be represented with this rectangle here. Nice to see you. AUDIENCE: Nice to see you. SPEAKER 1: You shall be 55. And I'll play the role of the temporary pointer as predecessor pointer or pointer just using my left or right hand to try to figure out where Christian belongs. So, just like you might-- just like we might want to search the list, inserting is fundamentally rather the same. The only thing I have access to at the outset of this algorithm is my first pointer, and it's only by way of Scully that I can even access the rest of the list. I cannot jump to the middle, jump to the end. I can only start at the beginning and literally follow the pointer. So, let me go ahead and do that. Let me go ahead and point at whatever Scully is pointing at, which happens to be Maria, which is the number 9. 55, of course, is bigger than that, and I do want to keep the list sorted for today's purposes. So I'm going to very carefully follow the next pointer, 17. Follow the next pointer, 20. Follow the next pointer, 22. Follow the next pointer, 29. Follow the next pointer, 34. Follow the next pointer, ah dammit, null. And so this is why it is important with some of these algorithms to have a predecessor pointer, a second pointer or really my left hand so that maybe my left hand can still point at Erin. My right hand can realize, ah, null, so that I still have access to the last node in the list so that Christian-- if you could come over here. I'm going to go ahead and tell Erin quite simply to point at Christian. Good, and let's just say for students' sake come on over here, but technically we could have left Christian there and just had Erin pointing at him. It's just going to get a little confusing before long, so we'll just cheat and move you right over here. But now we have a linked list that has one additional member. Suppose now that we want to make another insertion-- pardon. Let me go ahead and propose that we insert say the number 5. Well, the number 5, of course, belongs at the beginning of the list. So you know what? I need to malloc. Can I malloc Jordan off camera five, perhaps? So malloc, a very slow return value. OK, we're going to store your node your n value five here. His pointer is undefined right now, because he's not actually pointing at anything. And so where does he ultimately belong? Well, he belongs at the start of the list. So, let me deliberately make a mistake. Let me go ahead and update Scully to point at Jordan, thereby putting Jordan effectively at the front of the list. Unfortunately, whom should Jordan now point at technically? It should be Maria, but this is code. The only thing we can do is copy pointers in memory, and if Scully's left hand is no longer pointing at Maria, I have literally orphaned the entirety of this list. I have leaked 1, 2, 3, 4, 5, 6, 7 chunks of memory, seven nodes, because I got my order of operations out of order. Indeed, I should have done what-- let's undo, control Z. And now let me go ahead and do what? Jordan should point at the exact same thing Scully is pointing at, which has no downside. Even though it feels redundant, we've not lost any information. And now that Jordan is pointing at Maria, Scully's pointer can be pointed at Jordan. And now, even though the list looks a little weird, this is a key feature of the linked list. These nodes could have been malloc from anywhere. So indeed, even though we initially kept everyone physically sorted left to right-- and you've all cleaned the list up even since-- that's OK. The point is that all of these nodes are linked together. So, thank you so much to our volunteers. You can keep these pieces of paper and later on we'll have some stress balls for you. But that's the key idea here behind a linked list. Thank you. So, of course, there are some more complicated operations that we might have to deal with. For instance, if we want to insert into the middle of the list, that's going to be a little more of a burden on me, the program, keeping track of where things have to go. But nicely enough, there's only these three cases-- the beginning of the list, the end of the list, and the middle of the list, because middle of the list doesn't have to mean literally the middle, just anywhere that's not the beginning or the end. Of course, we should be careful to make sure that we handle the empty list scenario, which is equivalent to putting something at both the beginning of the list and the end of the list. But that would be perhaps a special case we could deal with separately. Of course, there are other operations like inserting-- or rather removing from the tail of the list, removing from the head of the list, and removing in the middle. And that would be the opposite of malloc, if you will. And in those cases, we have to take care to call our friend free to free those bytes of memory, give them back to the operating system so that we don't leak memory. But there, too, I'm probably going to have to be careful as to what order I change my pointers and free nodes. Because what you don't want to do, and what unfortunately you might very well accidentally do at some point, is free a pointer and then try to access that pointer or change the pointer, even after you've told the operating system I'm done with this address. That can give you what's called a segmentation fault, which is just one of the ways in which you can deduce that kind of mistake. So, let's actually implement one of these methods. And we'll pluck off one that allows us to actually take a look at the syntax with which we can manipulate pointers. And let's go ahead and implement a function called search, for instance, where search I [? proposed ?] just returns a bool, true or false, this number n is in the given the list. And now, why have I said node star list? Well, at the end of the day, a linked list is just a whole bunch of nodes. But the first of those nodes that we keep calling first is of what data type? If you have a pointer, a variable, that's pointing to a linked list, that means it's storing the address of a node, otherwise known as a node star. So, this would be the syntax with which you can pass to a function something like a linked list. You simply have to pass it a pointer to the first element in that list. And if I want to go ahead now and implement this, let me go ahead and propose the following. Let me go ahead here and give myself a temporary value, so node star pointer we'll call it, PTR. And that's going to equal the start of the list. So, I'm just creating another box of memory and I'm storing inside of it the same address that I was passed in, just so that I have a temporary variable that I can use to update. After this, let me go ahead and say while that pointer is not equal to null-- because recall that null is this special sentinel value that means end of the list. So inside of this loop, what do I want to do? I'm going to go ahead and say if pointer-- and now I have to get at the number inside of it. So, if I recall from the last time, we only spent a little bit of time on the student example, but we said something like student dot name or student dot dorm. And in this case I'm inclined to say pointer dot n, where n is the number, the integer that's inside. But pointer this time is not a struct, per se. It's the address of a node. It's the address of a struct. And so, perhaps the most intuitive piece of syntax in C, at least retrospectively now, is that if you want to access a piece of data that's inside of a node and you have a pointer to that node much like our arrows in the pictures imply, you literally draw an arrow using a hyphen and then using a right angle bracket. So, now if we do see-- whoops, let me finish my thought. If pointer n equals equals the n we're looking for, let me go ahead in here and say return true. Or else, let me go ahead and not return false, because I don't want to just check one element and then blindly say false. I instead want to say pointer should get pointer arrow next. And then only after that loop is all complete should I say something like nope, return false. So, what's actually going on here? The function declaration, again, took in two arguments-- one, an int n that we're looking for, two a pointer to a node, otherwise known as a node in a linked list. And per the pictures we've been drawing, you can access any other element in that linked list by way of the first element in that list, as suggested here. So, now I'm just giving myself a temporary variable called pointer, but I can call it anything I want. And I'm declaring it as node star, so that it can store the address of a node as well, and then I'm just initializing it to be the exact value that was passed in. So, I don't want to accidentally break the list that I was passed. I don't want to change the value of my parameter unnecessarily and complicate things. I just really want a temporary variable, much like I in the world of loops that allows me to constantly iterate through something and update it as I go while the whole along the way I want to be checking this. While pointer is not null. If pointer is null, that means I'm at the end of the list, or maybe more curiously, I was passed null, in which case there is no list. And that's a valid scenario that could happen, even though it's a bit strange. But if pointer is null, I don't want to proceed further inside of this loop. But so long as pointer is not null, let me go ahead and do this. Let me follow that pointer and go inside that node and say is your n value equal equal to the [? end ?] value that I've been asked to search for? And if so, return true. I don't want to just return false now because otherwise I'd only ever be checking the first element in a linked list. So, I now want to do the equivalent in spirit of i plus plus. But I'm not using i's. I don't need to use pointer arithmetic here, and indeed it won't work because I have this thing stitched together. It's not an array of contiguous memory. I want to say that my current temporary value pointer should get whatever pointer arrow next is, and then let this loop continue. If it's not null, check again. If it's [? end ?] value equals what I'm looking for and repeat, repeat, repeat until pointer equals null. So, let's make this more concrete. Let me go ahead and just draw a temporary picture here. And let me suppose here that what I have been passed is something like the following. Let's do a very simple linked list that has maybe the number one, and has the number two, and has the number three. And, again, I've drawn gaps between these nodes, because they could be anywhere in memory. So, technically they don't need to be left to right like this, but that'll keep us sane. And if this is indeed a correct linked list, there are pointers in each of those fields that point to the next node in the list, and that slash I drew in the last one just means null. You can draw it however you want. But for a linked list to work, we need to know the beginning of this thing. So, we'll call this first, and that of course has to point to the first element in my linked list. So, here's the state of our world. It's a linked list quite similar in spirit to the other one. It's a little shorter, just for the sake of discussion. And now, let's consider my function search, which again takes two arguments. So, that the first argument is of type int called n. And suppose I'm searching for the number three. The second argument is a node star, so the address of a node called list. So, what does that mean? When this function search is called, let's suppose that we currently have the value n, which is going to be 3, because that's arbitrarily the number of decided to search for. And then this other value pointer is going to be initialized-- sorry, not that. List is going to be whatever pointer is passed in as the second argument to this function. So, let's suppose that this linked list, this sample linked list at the top, is indeed what is passed in to this function. So, I've passed in 3, because I want to search for 3. So what goes here? If I pass this sample linked list into my search function, what is the value of list? Well, list if I'm past this first pointer is really going to be the pointer to the first element in the list. That's all we're saying. Node star list just means give me the address of a linked list, which means give me the address of the first node in the linked list, which means that initially when I call search my picture-- my stack frame, if you will, in terms of my local arguments-- is going to look like this. All right, so with that said, how does this code work? We recall in this code have this while loop that just sits in a loop checking whether the current nodes n equals equals the one we're looking for, and if not, it updates it. So, we need one more local variable called pointer that's initialized to the start of this list. So, this will be a pointer and it's initialized to the same thing that my second argument is initialized to. So, this now is the state of our world once one line of code has executed, that very first one in there. So, now let's implement the loop. While pointer does not equal null, so here's pointer. Does it equal null? No, because if it did, we would just draw a slash or some other piece of syntax. But it's pointing at clearly something that exists. So, this node here has some valid address. Pointer is pointing at it, so it's not null. So, what do I do inside of my code? I check if the n value inside of pointer, PTR, equals equals the number I'm looking for, and if so, return true, otherwise, if not I update pointer. So let's check. Let's follow the arrow, PTR, and look at the value n. Recall that the top of these boxes is n. The bottom of them is called next-- n, next, n next. So, I followed this pointer. I'm looking at the box called n. Does 3 equal 1? No, obviously not. So I update-- pointer gets pointer next. So, to be clear, pointer gets pointer next, that second to last line of actual code. So, what does that mean I need to do? That means I need to update pointer to be equal to pointer next. What is pointer next? Well, here's pointer and here's next. We were looking a moment ago at n. Now I'm looking at next. So, pointer next means that I should update whatever is inside this box-- and a lot more on the screen-- to be equal to pointer next, which is this field. This field is pointing at that, so that line of code has the effect of updating PTR to simply point at the second node. So, what happens next? I seem to still be in that loop and I say, well, pointer does not equal null, and it doesn't. It's pointing at that second node. If pointer arrow n equals equals n, but no that's not the case, because I'm looking for three. I'm pointing at two, so that is again false. So, again, I don't return true. I instead update pointer to equal pointer next. So, what has to happen here, at the risk of deleting my handiwork again, now pointer gets pointer next, which is this element, which is equivalent to pointing at this node here. And so, now I'm still inside that loop while pointer-- not equal to null. It's not null. If pointer arrow n equals equals n, well, let's follow that logic. If pointer, follow the arrow, n equals equals n, three-- which is the one I'm looking for-- returned true. And so, how then does this function ultimately behave? It would seem in this case to return true, because I have eventually found that number three. What would happen by contrast if I were looking not for three, but for four with this code? In other words, what if I'm not looking for three and I want to go one step further? Well, one step further is going to update PTR to equal null, that slash in my last node. And that means code wise, I'm going to break out of that loop, because pointer now does equal null. And so, by default that very last line of code return false, not found. So, complicated at first glance, and it certainly looks more complicated than things we've written before, but again, if you go back to basics, what does each of these lines mean? Consider that there's no magic here. This first line means give me a variable that's a pointer to a node. It'd be in other words the address of a node and assign it whatever I was passed in. While pointer [? naught ?] equals null, we've seen null before. It's this special zero value, and I'm just making sure that the pointer I'm using, PTR, does not equal that special value. And then inside of this loop I'm using one piece of new syntax, this arrow notation, which just like the picture suggests means go there, and then look at the field called n and check if it equals the n you're looking for, and if so return true. Otherwise, update yourself much like i plus plus but specifically update pointer to be whatever the value is when you follow the arrow in that next field. So, this of course is just search. We've not actually changed the list, but imagine, if you will, that you could now implement insert and delete, not simply by following these pointers but actually changing the value of next in a node to the left, a node to the right, or a new node all together. So, who cares? Why did we add all of this complexity? We had arrays, which were working really well for a whole bunch of weeks, and now we've claimed that arrays are not so good. We want to use linked lists instead. But why might we want to use linked lists? Well, linked lists gives us dynamism. We can call malloc and give ourselves more, and more, and more nodes and grow our list of numbers, even if we don't know in advance how many such numbers we need. And we can shrink them, similarly, so we don't have to allocate a massive array unnecessarily. We can shrink our data structure based on how many numbers we actually need. But we're paying a price. Search is a little bit slower, delete is a little bit slower. Insert would be slower if we insist on keeping things sorted, so we've paid this price. And indeed, this is thematic. In CS50, in computer science more generally, there's often going to be these trade-offs of time, or space, or just complexity, or your human time. Any number of resources can be in scarce supply. And, indeed, we've seen by way of linked lists that we're solving one problem while introducing another. It's like that silly situation you might see memes of where you cover your hands-- put your hand around a hose that has a leak and all of a sudden another leak springs up over there. We're just moving the problem elsewhere, but maybe that leak is less problematic to us than this one here. So, again, it's this theme of trade-offs. Now, this here is Mather Dining Hall. And this, of course, is a whole bunch of trays where you might go and get some food, lunch, breakfast, or dinner, or the like, and you pick up this tray and you put food on it. But what's interesting about trays, as Mather and a lot of cafeterias do, is trays are stacked one on top of the other. And it turns out now that we have this second building block with which to create data structures-- we're not just using arrays anymore. We now have pointers and this general idea of linking nodes together in our toolkit, we can now start to imagine more interesting data structures that solve problems in slightly different ways. For instance, suppose that I wanted to implement this paradigm of stacking things on one on top of the other like this. Indeed, this is a data structure called a stack, and it generally has two operations associated with it, push to push something on the stack and pop to take something off of the stack. And this is perhaps a useful data structure if you just want to store numbers or something else in really just the most efficient way for you without regard really to fairness. So, for instance, if this table here is my initial data structure and it's empty and I have a piece of information that I want to store, I'm just going to going ahead and put it right there. And now suppose I want to push another number onto the stack. I'm just going to go ahead and simply put it on top, third number on top, fourth number on top. But I've now committed to a certain property, if you will, a certain property whereby the last tray in has to be the first tray out, otherwise known in computer science as LIFO-- last in, first out-- because if I want to get that first number, I mean, I've created a mess for myself. I have to lift these all up or move them just to get at it. That just seems stupid. Intuitively, the easiest thing to grab is probably going to be the top, but that's not necessarily the first element I put in. But that's OK. This might still be a valid data structure-- and indeed later in the term when we introduced web programming and we look at languages like HTML, there's actually a number of applications where it's actually super useful to have a data structure where you can just stack stuff on top of each other in order to tuck some data away for subsequent use. And, indeed, when we talked about memory in our own computer, stacks clearly have some value. We talked about main and then swap, and then maybe other functions. There are many contexts, one of which we've seen already, where in life you actually want to stack things on top of each other so as to keep track of really what did I do most recently, because that's the next problem I'm going to deal with or the next frame-- in the case of memory-- that I'm going to pop off of the stack. So, how might we implement this data structure? Let me propose that if we want to define our own data type called the stack that implements that idea of cafeteria trays or frames in memory, let me go ahead and typedef a structure called stack inside of which are two data members. Suppose that, for the sake of discussion, I'm not going to try to store trays or something that doesn't really exist computationally but rather numbers, just integers. Inside of this structure called a stack is going to be an array called numbers of type int and it's going to store capacity. So capacity, let's assume, is hash defined elsewhere to be some constant. So, maybe it's 10, maybe it's 1,000, but it's some constant integer elsewhere that limits, ultimately, the capacity of the stack to some integral value. And then size-- this seems weird. I have capacity here and then size here. Well, there's the semantic distinction here. Just because you have a stack ready to go, as I did a moment ago-- just because I had this stack ready to go, empty initially, it's going to have some capacity. Realistically, I can only go as high as the ceiling or until the things fall over. But there's also a size here, and the size is currently zero, but the capacity might be like 1,000 or whatever. So, that's the difference there and the size now is 4 and the capacity is like 1,000 minus 4 at this point-- or rather, capacity is still 1,000, because that's the total possible size, not the actual size. But what if that's a limit? What if I don't want to restrict myself to some fixed, finite number of trays or a fixed number of numbers? Well, I could instead declare my stack as being a pointer. Now, this pointer initially has no value, so let's assume that it's probably going to be initialized to null in my code, but that too is not going to be useful. Why would I declare numbers now not to be in an array, which felt very straightforward. Normally we draw arrays left to right, but with the trays, just imagine turning an array vertically and thinking of the number stacked on top of each other. But this is just a pointer to a number. Why might I want to implement a stack as just a pointer to an int? That seems wrong. I want lots of numbers, not one number. So, what could I do? Well, what if in this world to implement a stack I invoke our friend malloc and I say to malloc, malloc, give me enough memory for 2,000 numbers or 5,000 numbers. What is malloc going to return? Well, by definition, we know malloc is going to return the address of a chunk of memory, and that chunk of memory is going to be of whatever size I ask malloc for, and the address of the first is really just equivalent to the address of one integer. And so long as I, the programmer, remember that I asked malloc for 2,000 integers or for 5,000 integers, I know implicitly the end of that chunk of memory and malloc just need to tell me the beginning. So, it's perfectly fine to implement the stack by way of a single pointer, because all I need to know is, hey, malloc, where should I put my first integer? Because I know via pointer arithmetic, per last week, that I can put my next integer four bytes later, four bytes later, four bytes later. And I'm deliberately going up this time, but it really is just an array where you can think of the array as left and right. So, this would be a way of giving ourselves a data structure called a stack that is not fixed from the outset like this previous version to some specific capacity. Now, we are limited only by how much physical memory or virtual memory my computer actually has. So, suppose Apple or someone similar implemented the lines outside their stores for the release of the iPhone as a stack. So, it's weird maybe to think of people stacking on top of each other, but maybe you could imagine Apple funneling everyone into the glass store here in Manhattan, and then whoever is the last one in gets their phone first. Because why? They're closest to the exit. So, you have all these people show up super early in the morning or days before, you pile them all into the store saying everyone, hey, please go into the corner there. Please get into the store. And then as soon as 9:00 AM rolls around and it's time to give out the iPhones, just for logistical convenience you realize, all right, why don't we just give the person who came in last their phone first because they're closest to the exit and get them out, last in, first out? Good design, bad design? It's correct in so far as everyone's going to get an iPhone if supply is there, and that's never going to be the case. So, it's not necessarily very equitable or fair, and indeed the humans are not going to be very pleased with Apple if they used a LIFO data structure or a stack. What would these fans of Apple hardware prefer that Apple use? We call it a line. If you go to the UK, they call it a queue, which is actually a perfect answer, because there's this other data structure in the world called a queue, which is exactly what you would hope the Apple store line would be, a line whereby it's first in, first out. So, the first person there three days before, at 5:00 AM gets his or her phone first, and the one person who comes in at 9:01 AM doesn't get their phone because they're at the last position in the queue or the list. And a queue, nicely enough, might just have at least two operations-- enqueue and dequeue whereby enqueue means get into line d queue means get out of the line, but these happen at different places. For instance, if there's a whole bunch of people lined up here on the stage, closest over there's the front of the list. I get here last. I enqueue myself at the end of this data structure, but you dequeue someone from the beginning of the list. By contrast, when we had a stack, when you push someone onto the stack, you pop it off, or him or her off first by nature of it being a LIFO data structure. So, how might we implement a queue? It's actually slightly more complicated, 50% more pieces of information you need to keep track of, the front of the list. But you can still do it in an array. So, suppose that we do use an array, and let me go ahead and draw this as follows. Suppose that like hopscotch we draw the queue for an Apple Store like an array like this. And here is the door of the Apple store, so you want to be at location zero, ideally. 1, 2, 3, 4, 5, 6-- so this is how many people can fit into our queue in this case. So, suppose that Alice wants to buy an iPhone and she gets to the store first. Where should she go to keep things fair? This is the queue, so we don't want to put her into the corner, so to speak, in our first example. We want to put her at the front of the list. So, Alice belongs right there, pretty straightforward. Now, Bob arrives and he comes in slightly after Alice, so he gets to get behind Alice in line. And so Bob is there, and maybe Charlie arrives thereafter, and then so forth. David maybe comes in fourth and beyond. So, that's how people would queue up, so to speak. And now, when it's time for Apple to open this door and start selling iPhones, what happens? We want to take Alice out of the list first. We want to de-queue Alice. So, we need to start remembering some information, because it's not sufficient now to just remove-- whoops, it's not sufficient just to remove Alice like this, because suppose that we do keep adding other people, person d, e-- whoops, that's not an f. OK, don't know what happened there. Person f is here, g, h. Suppose that Alice has bought her phone and left the store, and then Bob does the same. He goes ahead and leaves the store, and then Charlie leaves the store. Where do I put person i who maybe shows up a little late? It would seem that I want to put them at the end of the queue, which makes good sense, but right now d, e, f, g, and h are still in the queue, and this is an array I proposed. My data structure's an array, so I can't just move d to the front of the line easily. I have to actually shift him or move him, and this might conjure up some memory of our searching and sorting examples where when we had our humans on stage, we actually had to physically move people like an insertion sort to make room for those elements. And that's fine. We can absolutely say, hey, David, come over here please and person e, come over here please, and that's obviously what the Apple store does. But when you're implementing this idea in memory, you can't just ask the numbers themselves or the elements themselves to do that moving. You need to do it for them, and that's going to cost you time, and that's going to be a price that you have to pay. But I propose that we can be clever here. We do not need to incur the cost of moving d, e, f, g h where Alice, Bob, and Charlie previously were. Where can we put person i instead? I mean, there's obviously room in the line, so maybe why don't we just put person i here? But, again, we don't want to piss off everyone who's already in the line. So, if this now is an array, we have to be mindful of the fact that the front of this list has to be remembered separately. This data member here front really should store not 0 in perpetuity but really 0, 1, 2, 3. It should store the current front of the list. And I need another variable, presumably, called size that keeps track of how many elements are in the list, which in this case is going to be six. So with a queue, if I'm implementing it using an array, there's some added complexity if I want to avoid the inefficiency of moving all of these elements and incurring the kind of running times we saw when we talked about searching and sorting the other day. There's no reason mechanically to move d, e, f, g, h anywhere in the array. We can in constant time, and maybe our old friend the modulo operator that you might have used in [INAUDIBLE], we can just figure out where i, and j, and everyone else should go so long as we keep track separately with a queue of what the front of the list would be. And why is this 3? Well, if I continue numbering the array like this, as we've often done, you can now see that d is the head of the list, or the front of the list. And so, we should remember his location there as 3. But, again, what happens if j, k, and then l shows up? There is no room for l in this world, not to mention m, n, o. So what if we solved this problem as before by changing the array from having some fixed capacity to having no pre-determined capacity, just use a pointer so that we can use malloc to dynamically allocate a big chunk of memory, remember its capacity ultimately, but also remember the front and the size of this data structure? So, the same idea there might apply. So, at the end of the day, what have we done? We've taken these new building blocks, pointers, and this notion of linking things together using pointers much like a linked list, and we've looked back at our data structure called an array. And using these now, we can implement what are generally called abstract data types, a queue in a stack does not have as a low level meaning as an array does, which is a very technical concept. And in a linked list, this is a very technical concept. It's a node with pointers linking things together. A stack is like a stack of cafeteria trays, or a queue is something like people lining up outside of an Apple Store. These are abstract data types that can be implemented clearly underneath the hood in at least a couple of different ways. You can use an array and keep things simple, a la weeks two and beyond in the class, but you're going to paint yourself into a corner by fixing their size, as I did a moment ago, by declaring this queue and before it a stack to be of a fixed capacity. But now that we have pointers, and malloc, and dynamic memory allocation, and this spirit of linked lists, we can change that to actually be numbers and actually just remember where things are underneath the hood. And nicely enough, a stack in a queue doesn't even need me to stitch things together like a linked list. I just need malloc in order to allocate those. So, let's tie these two topics together if you would and compare and contrast them by way of a wonderful animation that a fellow educator made and posted online that we've abbreviated here to give us a sense of the difference between stacks and queues. [AUDIO PLAYBACK] [MUSIC PLAYING] -Once upon a time, there was a guy named Jack. When it came to making friends, Jack did not have the knack. So, Jack went to talk to the most popular guy he knew. He went to Lou and asked what do I do? Lou saw that his friend was really distressed. Well, Lou, began just look how you're dressed. Don't you have any clothes with a different look? Yes, said, Jack. I sure do. Come to my house and I'll show them to you. So, they went off to Jack's, and Jack showed Lou the box where he kept all his shirts, and his pants, and his socks Lou said I see you have all your clothes in a pile. Why don't you wear some others once in a while? Jack said, well, when I remove clothes and socks, I wash them and put them away in the box, then comes the next morning and up I hop. I go to the box and get my clothes off the top. Lou quickly realized the problem with Jack. He kept clothes, CDs, and books in a stack. When he reached something to read or to wear he chose [? the top ?] book or underwear. Then when he was done he would put it right back, back it would go on top of the stack. I know the solution, said a triumphant Lou. You need to learn to start using a queue. Lou took Jack's clothes and hung them in a closet, and when he had emptied the box he just tossed it. Then he said, now, Jack, at the end of the day put your clothes on the left when you put them away. Then tomorrow morning when you see the sunshine, get your clothes from right from the end of the line. Don't you see, said Lou. It will be so nice. You'll wear everything once before you wear something twice. And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself all thanks to Lou and his wonderful queue. [END PLAYBACK] SPEAKER 1: All right, so let's take a look at another data type, this one known as a tree. Because now that we have the ability to stitch data structures together much like a linked list, we now have the ability to stitch things together not just left to right or top to bottom conceptually, but in any number of directions. And indeed, there's nothing stopping us from having one node linked to by way of multiple pointers, multiple nodes. So, for instance, this picture here from a textbook is a tree structure. And it's very much like the family trees that you might have drawn in grade school or the like. But in this case, you have just one root node, the node at the top of the data structure, so to speak, from which everything else descends. And that node is said to have children. For instance, 2 and 3 are children of the node number 1 here. And then there's other semantics in this world of trees in computer science. Much like family trees, anything that does not have children-- like 5, 6, and 7, or 8 and 9-- would be called leaves of the tree, because like the leaves at the end of the branches, there is nothing beyond them. So, nicely enough we borrow a lot of the language from family trees and actual trees in order to discuss this data structure known as a tree. But why in the world would we want to lay out data in a tree structure? Now we just seem to be doing things because we can, it would seem at first glance. Because, for instance, suppose we had these numbers-- 22, 33, 44, 55, 66, 77, and 88. They're clearly sorted. And suppose that I wanted to lay these out in a data structure and be able to search them efficiently, assuming the whole time that they are indeed sorted. Well, if we wanted to do that, we have our old friend arrays from weeks ago. And we also have our old algorithm from Mike Smith, our binary search algorithm, divide and conquer. And we can find nodes in this data structure super, super fast in logarithmic time, big O of log n. So, we've solved that problem. But it turns out we don't necessarily have to use an array laying out data from left to right because, again, one of the prices we pay of using arrays where as we've realized today is this finiteness. At the end of the day, the size of an array is fixed. You have to decide in advance how big your array is going to be. So, what if you want to add more numbers to it? What if you want to remove numbers for efficiency and not waste so much memory? You can't really do that with an array. You can, but have to jump through some hoops. You have to reallocate the array, as with a function like [? re-alloc ?] if you indeed used malloc in the first place to allocate it. But then you have to copy the old array into the new array, so it's all possible. Nothing's impossible once you have a keyboard at your disposal, but it's a lot of work, and it's more time, and it's expensive there for both in terms of your time and the computer's time. But could we achieve the beauty of divide and conquer and binary search from week zero without the constraints that arrays impose? And today, the solution to all of our array problems seems to be linked lists or more generally pointers so that, one, we can dynamically allocate more memory with malloc when we need it and then use pointers to thread or stitch together that node with any existing nodes. So, indeed let me propose this variant on a tree structure that the world calls binary search trees or BSTs. Binary in this case means two, and this just means that every node in this tree is going to have 0, 1, or 2 maximally children. And now, in this case binary search tree means that for every node in the tree it's left child is less than it and its right child is greater than it. And that's a recursive definition. You can look at the root of this tree and ask that same question. 55? Is it greater than its left child? Yep. Is it less than its right child? Yep. That is the beginning, it would seem, of a binary search tree. But it's recursive in so far as this is indeed a binary search tree if that statement is true. Those answers are the same for every other node in the tree. 33, is its left child smaller? Yep. Is its right child bigger? Yep. How about over here, 77? Left child smaller? Yep. Right child bigger? Yep, indeed. How about the leaves of the tree? Is 22 greater than its left child? I mean, yeah, there is no child, so yes, that's a fair statement. It certainly doesn't violate our guiding principle. Is it less than its right child, if any? Yes, there just isn't any. And so this is a binary search tree. And indeed, if you took a scissors and snipped off any branch of this tree, you would have another binary search tree, albeit smaller. But it's recursive and that definition applies to every one of the nodes. But what's beautiful here now is that if we implement this binary search tree, similar in spirit to how we implemented linked lists using not arrays but using pointers and not one pointer but two pointers whereby every node in this tree apparently has up to two pointers-- let's call them not next but how about left and right just to be intuitive. Well, if every node has a left and a right pointer, now you can conceptually attach yourself to another node over there and another node over there, and they too can do the same. So, we have the syntax already with our pointers with which to implement this. But why would we? Well, one, if we're using pointers now and not an array, I can very, very easily allocate more nodes for this tree. I can insert 99 or 11 really easily, because I just called malloc like I did before. I put the number 99 or 11 inside of that node, and then I start from the root of the tree, much like I start from the first element in the linked list, and I just search for its destined location going left or right based on the size of that value. And what's nice, too, here is notice how short the tree is. This is not a linked list. It's not a long list, whether vertically or horizontally. This is very shallow this tree. And indeed I claim that if we've got n elements in this list, the height of this tree it turns out is log of n. So, the height of this tree is log of n, give or take one or so. But that's compelling, because how do I search this tree? Suppose I am asked-- I'm trying to answer the question is 44 on my list? How do I answer that? Well, we humans can obviously just look back and it's like, yes, 44 is in it. It's not how a computer works. We have to start from what we're given, which in this case is going to be as the arrow suggests a pointer to the tree itself, a pointer towards first node. And I look is this the number 44? Obviously not. 55 is greater than 44, so I'm going to go down to the left child and ask that same question. 33, is this 44? Obviously not, but it's less than it so I'm going to go down to the right child. Is this 44? Yes, and simply by looking at three nodes have I whittled this problem down to my yes no answer. And indeed, you can think of it again with scissors. I'm looking at 55 at the beginning of this story. Is 44 55? No, 44 is less. Well, you know what? I can effectively snip off the right half of that tree, much like I tore that phone book in week zero, throwing half of the problem away. Here I can throw essentially half of the tree away and search only what remains and then repeat that process again, and again, and again, whittling the tree down by half every time. So therein lies our logarithmic running time. Therein lies the height of the tree, so long as I am good about keeping the tree balanced. There's a danger. Suppose that I go ahead and start building this tree myself in code and I'm a little sloppy about doing that. And I go ahead and I insert, for instance, let's say the number 33. And it's the first node in my tree, so I'm going to put it right up here at the top. And now suppose that the next number that just happens to get inserted into this tree is 44. Well, where does it go? Well, it has no children yet, but it is bigger, so it should probably go over here. So, yeah, I'll draw 44 there. Now, suppose that the inputs to this problem are such that 55 is inserted next. Where does it go? All right, 55, it's bigger, so it should go over here. And then 66 is inserted next. All right, it goes over here-- never mind that. So, what's happening to my binary search tree? Well, first of all, is it a binary search tree? It is because this node is bigger than its left child, if any-- there just isn't any-- and it's less than its right child. How about here, 44? It's bigger than its left child, if any-- because there is none-- and it's smaller than its right child. The same thing is true for 55, the same thing is true for 66. So, this is a binary search tree and yet somehow what does it look like? It looks like a linked list, right? It's at a weird angle. I've been drawing everything horizontally, but that's a meaningless artistic detail. It devolves potentially into a linked list. And so, binary search trees if they are balanced, so to speak, if they are built in the right order or built with the right insertion algorithm such that they do have this balanced height, this logarithmic height, do afford us the same logarithmic running time that the phone book example did and our binary search of an array did. But we have to do a little bit more work in order to make sure that these trees are balanced. And we won't go into detail as to the algorithmics of keeping the tree balanced. But realize, again, there's going to be this trade-off. Yes, you can use a binary search tree or trees more generally to store numbers. Yes, they can allow you to achieve that same binary or logarithmic running time that we've gotten so used to with arrays, but they also give us dynamism such that we can keep adding or even removing nodes. But, but, but, but it turns out we're going to have to think a lot harder about how to keep these things balanced. And indeed, in higher level CS courses, courses on data structures and algorithms will you explore concepts along exactly those lines. How would you go about implementing insert and delete into a tree so that you do maintain this balance? And there is yet more variance on these kinds of trees that you'll encounter accordingly. But for our purposes, let's consider how you would implement the tree itself independent of how you might implement those actual algorithms. Let me propose this type of node. Again, notice just the very generic term in programming where it's usually like a container for one or more other things, and this time those things are an integer-- we'll call it n but it could be called anything-- and two pointers. And instead of next, I'm going to just by convention call them left and right. And as before, notice that I do need to declare struct node up here or some word up here. But by convention I'm just going to do typedef struct node, because C reads things top to bottom, left to right. So if I want to refer to a node inside of a node, I need to have that vocabulary, per this first line, even though later on I just want to call this whole darn thing a node. And so, that's the distinction. This actually has the side effect of creating a data type by two different names. One is called struct node, and you can literally in your code write struct node something, struct node something. It just feels unnecessarily verbose, so typedef allows you to simplify this as just node, which refers to the same structure. But this is necessary for this innermost implementation detail. So, now that we have the ability with a structure to represent this thing, what can we actually do with it? Well, here is where recursion from a few weeks ago actually gets really compelling. When we introduced that sigma example a while ago and talked in the abstract about recursion, frankly, it's kind of hard to justify it early on, unless you actually have a problem that lends itself to recursion in a way that makes sense to use recursion and not just iteration, loops-- for loops, while loops, do while, and the like. And here we actually have a perfect incarnation of that. What does it mean to search a binary search tree? Well, suppose I'm searching for a number n and I'm being given a pointer to the root of the tree, and I'll call it tree. So, just like when I was searching a linked list, I'm given two things, the number I'm looking for and a pointer to the first thing in the data structure-- the first thing in a linked list or the first thing in a tree. And in this case, we would call that first thing in a tree a root, generally speaking. So, the first thing I had better do in my search function is check, wait a minute. If tree equals equals null, don't do anything. Do not risk touching any pointers, because as you may have gleaned already or soon will with some of CS50's problems, you will cause quite probably a segmentation fault, a memory-related problem in your program such that it just crashes. It literally says segmentation fault on this screen if you touch memory that you should not. And you should not touch null values. You should not go to null values. You should not do star of any value that itself might be null. And so, if tree equals equals null is super, super important here, because I want to make sure to immediately say, well, if you hand me null, that's like handing me no tree whatsoever. So, my answer is obviously false. N can't be in a non-existent tree. But we need that condition up top, because the next case is [? noticed through ?] the following. Else if n-- the value we're looking for-- is less than the value of n in this node-- tree, recall, doesn't refer to the whole thing, per se, in this context. It refers to the current node that we've been past, which at the beginning of the story is the root of the tree. So, if the number n in the root of the tree is greater than the number we're looking for, we want to go to the left. Else we want to go to the right and search the right subtree. So, what's the syntax here? If the n we're looking for, like 44, is less than the value at the current node, 55, then what do we want to do? We want to call search, still searching for the same number n but searching on the left subtree. And how do you pass in a pointer to the left tree? Well, you have in tree a pointer to the root node. Tree arrow left just means go to my left child and past that value in instead, pass its address in instead. Meanwhile, if the number you're looking for is actually greater than the value in the current node, search the right subtree, else return true. Because if the list is not null-- if there is actually a list and the number you're looking for is not less than the current node and it's not greater than the current node, it must be the current node, so you can return true. But there's one important detail here. I didn't just call search. I called return search in each of these two middle cases. Why is that? Well, this is where recursion gets potentially a little mind bending. Recursion is the act of a function calling itself. Now, in and of itself, that sounds bad, because if a function calls itself, why wouldn't it call itself again, and again, and again, and again, and again, and just do this infinitely many times such that now you get a stack overflow where all of those frames on the stack hit the heap and bad things happen. But no, recursion works beautifully so long as every time you recurse, every time a function calls itself it takes a smaller byte of the problem. Or rather, put another way, it throws away half of the problem, as in this case, and looks only at a remaining half. Because if you keep shrinking, shrinking, shrinking, shrinking the problem, you will eventually hit this base case where either there is no more tree or you're looking right at the node that you want to find. And so, by returning search and tree left, or returning search and tree right, you're deferring the answer. When you, the search function, are called and asked is the number 44 in this tree, you might not know because the node you're looking at at the beginning of the story was again 55. But you know who does know? I bet my left child will know the answer to that if I just ask it by passing it-- passing to search a pointer to it, my left child, and passing in that same number 44. So, saying return search is like saying I don't know. Ask my left child. Or I don't know, ask my right child and let me return as my answer whatever my child's answer is instead. So, you could do this same function using iteration. But you could solve it arguably much more elegantly here using recursion, because a data structure like this-- like a binary search tree, which again is recursively defined-- each node is conceptually identical, if numerically different from the others, allows us to apply this algorithm, this recursive algorithm to that particular data structure. Now, let's look at a more concrete incarnation of trees that allows us to do something pretty neat and pretty real world. Indeed, this is another problem borne of a real world domain of compression. We talked a couple weeks ago about encryption, the art of concealing or scrambling information. Compression, meanwhile, is the art of taking something that's this big and compressing it to make it smaller, ideally without losing any information. It's pretty easy to take a 10 page essay that's maybe-- that was supposed to be a five page essay and just remove paragraphs from it or remove sentences from it. But that changes the meaning of the paper, makes it a worse paper, even though you're compressing it by making it smaller. No, what most students would typically do, if you've written 10 pages and it needs to fit into five, you really, really, really shrink the font size or increase the margins. Or maybe more realistically you write a five page paper that's supposed to be a 10 page paper, and so you increase the font size or increase the margins so as to expand or decompress the essay. So, similarly here, what if we wanted to compress text, but we want to do it losslessly in a way that we don't lose any information by just throwing away characters, or paragraphs, or pages, but we want to use the system with which we're familiar from week zero. So ASCII, again, is just this code, this mapping of letters to numbers. And so, A is-- capital A is 65 and that's some pattern of bits, but it's some pattern of 8 bits-- 7 historically, but really 8 bits in practice So every one of the characters in the English alphabet, at least here, takes up 8 bits. Now, that sounds fine. That allows us to express as many as 256 possible characters, which is more than enough for English characters, plus some punctuation and so forth. But it seems wasteful. I type A, E, and I, maybe O and U pretty often. I use the values often-- the vowels often. B and D, I feel like I use those a lot. I don't really type Q all that much, Z all that much. So, there are certain letters that I just feel like I don't type them that often, and indeed, probably if we analyzed a dictionary, we wouldn't see them as frequently as other letters. Indeed, if you've ever played or watched Wheel of Fortune, certainly all the contestants on that show know which are the most popular letters in English words. And it seems silly and perhaps inefficient-- certainly for a computer scientist-- that we are not somehow embracing the fact that some letters are more commonly used than others, and yet we are just blindly using 8 bits, the same amount of memory, for every darn letter in our alphabet. Why? If you keep writing a certain letter again and again, why not use fewer bits for the more popular letters, and more bits for the less popular letters so that at least you're optimizing for the common case, so to speak? Well, it turns out that someone named Huffman years ago did figure this out and introduced what's generally known as Huffman coding. And, at first glance, it's a little similar in spirit to something some of you might have grown up learning a little something about called Morse code, but it's better in a couple of ways. Morse code typically transmitted with electrical signals or audible signals. It has dots and dashes where a dot is a quick beep and a dash is a slightly longer beep, and you can use those series of dots and dashes, as per this chart here, to represent letters of the alphabet and some numbers. The one problem, though, as efficient as this seems-- and then by efficient I mean look at E. Mr. Morse realized that is super popular, so he used literally the shortest symbol for it, just a dot, a simple blip, to represent an E. And, meanwhile, as I kind of imagined, Z is not that common, so dash, dash, dot, dot is longer than just a single dot. So Z is probably less popular, and that's why we did this. And Y may be even less popular-- dash, dot, dash-- I don't know why I'm using this voice. But it's longer than E, so we optimized for the shorter characters. Unfortunately, suppose that you receive the message dot, dot, dot, dot, dot, dot, so six dots in a row, and I technically paused in between them. Six dots, what message did I just send you? Six dots. So, I wanted to say hi, so I said dot, dot, dot, dot, which is H, and then dot, dot which is I. I should not have paused between them, because the whole point of Morse code is to do this as quickly as possible, even though you probably do want to pause to resolve ambiguity, and indeed, that's the problem. I wanted to send you hi, H-I, but maybe I just sent you E, E, E, E, E, E, six Es in a row, because those two were just dots. So, in other words, Morse code is not immediately decodable when you're reading, or hearing, or seeing the dots and dashes come over the wire, so to speak, because there's these ambiguities, unless this transmitter does indeed pause, as I accidentally did there, to give you a moment to take your breath and realize, oh, that was an H. That's an I. As opposed to E, E, E, E, E, E. So, it's not necessarily the best system in so far as some letters share prefixes with other letters. In other words, I, dot dot, has a common prefix with E. Both of them start with a single dot. It just so happens that I is a little longer, and that can lead potentially to ambiguity, and it certainly means that the transmitter should probably slow down. So, the whole system is meant to be super fast, super efficient, but you probably should pause between certain letters so that the recipient doesn't get confused as to the message you're actually sending. Well, thankfully Huffman coding-- which as we'll see in a moment is based on trees-- does not have that ambiguity. It is a immediately decodable. And suppose for the sake of discussion, as per this example here, you just have a whole bunch of text that you want to transmit. This is meaningless. There's no pattern in these As, and E, B, C, Ds, and Es, but if you go through and count them up, each these letters-- A, B, C, D, E-- occur with some frequency in this text. So, it's meant to be representative of an essay, or a message, or whatever that you want to send to someone. Indeed, if you count up all of the As, Bs, Cs, Ds, and Es, and divide by the total number of letters, it turns out that 20% of the characters in that random string are As, 10% are Bs, 10% are Cs, 15% are Ds, and 45% are Es, so it's roughly consistent with what I'm claiming, which is that it E is pretty popular. So, intuitively, [? it ?] would be really nice if I had an algorithm that came up with some representation of bits that's not just 8 bits for every darn letter but that is a few bits for the popular letters and more bits for the less popular letters, so I optimize, again, so to speak, for the common case. So, by this logic E, hopefully, should have a pretty short encoding in binary, and A, and B, and C, and D should have slightly longer encoding, so that again if I'm using E a lot I want to send as few bits as possible. But I need this algorithm to be repeatable. I don't want to just arbitrarily come up with something and then have to tell you in advance that, hey, we're using this David Malan system for binary. We want an algorithmic process here. And what's nice about trees is that it's one way of seeing and solving exactly that. So, Huffman proposed this. If you have a forest of nodes, so to speak, a whole bunch of trees-- each of size one, no children-- think of them as each having a weight or a frequency. So, I've drawn five circles here, per this snippet from a popular textbook that has 10%, 10%, 15%, 20%, 45% equivalently in each of those nodes. And I've just labeled the leaves as B, C, D, A, E, deliberately from left to right because it will make my tree look prettier, but technically the lines could cross and it's not a big deal in reality. We just need to be consistent. So, Huffman proposed this. In order to figure out the so-called Huffman tree for this particular text, in order to figure out what to encode it's letters as with zeros and ones, go ahead and take the two smallest nodes and combine them with a new root node. So in other words, B and C were both 10%. Those are the smallest nodes. Let's go ahead and combine them with a new root node and add together their weights, so 10% plus 10% is 20%. And then arbitrarily, but consistently, label the left child's edge or arrow as a 0 and the right arrow's edge as a 1. Meanwhile, repeat. So, now look for the two smallest nodes. And I see a 20%-- so ignore the children now. Only look at the roots of these trees. And there's now four trees, one of which has children, three of which don't. So, now look at the smallest roots now and you can go left to right here. There's a 20%, there's a 15%, there's a 20%, and a 45%. So, I'm not sure which one to go with, so you just have to come up with some rule to be consistent. I'm going to go with the ones on the left, and so I'm going to combine the 20% with the 15%, the 20% on the left. Combine their weights. That gives me 35% in a new root, and again label the left branch 0 and the right branch a 1. Now, it's not ambiguous. Let's combine 35% and 20% with a new root that's 55%. Call its left branch 0, its right branch 1. And now 55% and 45%, combine those and give us a 1. So why did I just do this? Well now I have built up the so-called Huffman tree for this input text and Huffman proposed the following. To figure out what patterns of zeros and ones to use to represent A, B, C, D, E, simply follow the paths from the root to each of those leaves. So what is the encoding for A? Start at the root and then look for A-- 0, 1, so 0, 1 shall be my binary encoding for A in this world. How about B? 0, 0, 0, 0 shall be my binary encoding for B. How about C? 0, 0, 0, 1 shall be my encoding for C. How about D? 0, 0, 1. And beautifully, how about E? 1. So, to summarize, what has just happened? E was the most popular letter, B and C, were the least popular letters. And if we summarize these, you'll see that, indeed, B and C got pretty long encodings, but E got the shortest encoding. And it turns out mathematically you will now have a system for encoding letters of the alphabet that is optimal, that is you will use as few bits as possible because you are biasing things toward short representations for popular letters, longer representations for less popular letters. And mathematically this gives you the most efficient encoding for the original text without losing any information. In other words, now if Huffman wanted to send a secret message to someone in class or over the internet, he and that recipient simply have to agree on this scheme in advance and then use these encoding to transmit those messages. Because when someone receives 0, 1 or 0, 0, 0, 0 or 0, 0, 0, 1 from Mr. Huffman, they can use that same look-up table, if you will, and say, oh, he just sent me an A or, oh, he just sent me a B or C. So, you have to know what tree Huffman built up. And, indeed, what typically happens in actual computers is when you use Huffman coding to compress some body of text like we just have here, you store the compressed text by storing your As, Bs, Cs, Ds, and Es and other letters using these new encoding, but you somehow have to embed in that file in the compressed file the tree itself or this cheat sheet of encodings. So, with compression-- maybe you're compressing a Microsoft Word file, or a dot TXT file, or any other type of file, you have to store not just the compressed text using these shorter representation-- not 8-bit ASCII, but these shorter representations-- but you also somewhere, maybe at the beginning of the file or at the end of the file, somewhere where someone else can find it, you need to store this mapping or you need to store the tree itself in some digital form. And so, it's possible by this logic that you might try to compress a really small file, and that file could actually become bigger because you're storing a tree inside the file to-- with which to recover the original information. Or better yet, most algorithms or most actual compression programs will realize, wait a minute, if compressing this file is actually going to make it bigger, let's just not compress it at all and leave it alone untouched. So, what if you take a compressed file and compress it again, and compress it again, and compress it again? A dangerous assumption to get into is, well, I could just maybe keep compressing that video file again, and again, and again, and again, and I can maybe compress my big essay, or my big video file, or big music file to just maybe one bit. Right? That's the logical extreme, just keep compressing, compressing, compressing, compressing. But, of course, that can't possibly make sense, because if you compress some file down to just a single bit, 0 or 1, you've clearly thrown away information and can't possibly recover it all. So, at some point, too, you've hit this lower bound on the size of the file until you need to start throwing actual information away. At some point, the file just has so much entropy, appears to be so random, there really is no pattern to start to leverage to compress it further. And so, there generally is some maximum amount of compression you can apply to something. So, how would we represent this? Let's whip out a C struct here. So, this time each of the nodes in a Huffman tree need a little something different. They need, at least in the leaves, some kind of character to remember the symbol. Now, technically only the leaves need to know what symbols they are, so it's a little redundant to have this in every node, but we can keep things simple and use the same type of node for everything. Float frequency, I could use an integer and treat it exactly as a percentage, or I can use a float as the nodes were with 0.1 and 0.45 and so forth, and I'll call that frequency. And then each of those nodes needs a left child potentially and a right child potentially. And, again, I'll call these things a node. So, again, it's getting a little more involved this node, but it still allows me to represent it ultimately in C. And now, it's time to pursue lastly the holy grail of data structures, if you will. Thus far, we've been solving problems, creating new problems, trying to solve those again. And the problems we've been exploring this week are things like dynamism, if we want to be able to grow or shrink our data structure. Malloc and pointers give us that flexibility but might cost us a bit more time, because we have to keep things sorted differently or we have to follow all of those pointers. And so, a lot of the algorithms we've been discussing today at least have-- like linear time, searching, or inserting, or deleting potentially like in a linked list. Better still would be something logarithmic like a balanced binary search tree, so still preserving that nice binary aspect from week zero. But the holy grail of a data structure for its operations is Big O of 1 so to speak, constant time. If you are searching, or inserting, or deleting, and somehow changing a data structure, wouldn't it be amazing if every darn operation takes just one step, or maybe two steps, or three steps but a constant number of steps? Now, it might be a little naive for us to expect that we can store an arbitrary amount of data in some fancy way that we get constant time, but maybe just maybe if we're clever we can get close to that. So, let's introduce a step toward that. It turns out there exists in this world things called hash tables. And a hash table can be implemented in any number of ways, but you can think of it really as just an array. So, for instance, this might be a way of representing a hash table called table, whose first location is bracket zero and whose last location is bracket n minus 1 for however long this is. And I just left it as blanks. I don't even know what this hash table might want to store. It could be numbers, it could be names, it could be letters, it could be anything we want. But hash table has this nice theoretical property that if well-designed and thought through, you can maybe just maybe get constant look up time in it. And let's do a simple example of a hash table. Hash tables are often nicely thought of as buckets, so we borrowed these from the loading dock outside just a little moment ago, and we've attached thanks to Arturo some of these signs to them. This is going to be Z, so I'll just put this over here. This is going to be C, so I'll put this over here, and B here, and A. And we thought we might get chased away by the folks on the loading dock, so we didn't bother getting D through Y, So we'll just pretend that we have 26 such buckets here. And suppose that the goal at hand is-- I don't know, it's like at the end of an exam, so we've got our old blue books that a class might use for students writing essays in some class. And it's time for the students to come submit their blue books. Now, we could just collect them all and make a big mess as would generally be the case, or we can be a little more methodical to at least make our jobs easier. Now, at the end of the day, what's going to be interesting about hash tables is that there's going to be this distinction between actual benefits and theoretical benefit, or lack thereof. So, we'll come to that in just a moment, but here's A, B, C, D, and Z. And you know what? I just am going to ask the students in this class-- there are so many people in the room after an exam, I just want them to at least make my life 1/26 as difficult by putting all the As over there, all the Bs here, all the Cs here, all the Zs here, so that I don't have a massive mountain of As through Zs that I have to sift through individually. It would just be nice if they do the first pass of bucketizing the values based on the first letter in their last name. In other words, my hash function, my algorithm, is going to be for each student to consider his or her last name, look at the first letter they're in, and put his or her exam in the appropriate bucket. So, here is, for instance, someone with the letter C. I'm going to put that blue book in here. Here's someone with the letter A. That one's going to go here. Letter Z? This one's going to go over here. Letter B? This is going to go over here. C, and B, and F-- Z, I mean, and all of [? the ?] [? letters ?] of the alphabet in between. So, hashing really has this visual and conceptual equivalence of putting something in this bucket, putting something in that bucket, putting something in this other bucket, ultimately bucketizing all of your elements. And you can think of this, frankly, as just an array, but it's not just an array with one spot. It looks I can stack multiple numbers or multiple blue books inside of that array. So, we're going to have to come back to that, because this clearly can't be an array. Normally, the array would be filled the moment you put one value in it. But this hashing is the interesting part. The juicy ingredient today is if I take into account as input what it is I'm trying to store, use some piece of that information to decide where to put it, that's an algorithm, because I can repeat that process, so long as it's not random. You go over here, you go over here. That's amazing. Wow, OK, pushing my luck. OK, so I'm not just randomly putting things here. I'm actually giving some thought as to where I'm putting things, and that makes the algorithm deterministic, repeatable, predictable so that if you insert something now, you can absolutely find it if present later. Unfortunately, if our hash table does look like this, just a simple array from bracket 0 to bracket n minus 1 dot, dot, dot in between, and it's just an array for integers or an array for strings or whatever, once you put something here, or here, or here, that's it. There is no more room to put another element there wide as I might have drawn this table. If there's an int there, that's it. So, what could you do? Suppose that you do have an array structure like this, and that is unacceptable. You have a whole bunch of elements here and this table looks like this, and you consider this table like this. And maybe it's just where you're supposed to take attendance or put people's names. So, if you say, oh, Alice is here today. Let me go ahead and hash on Alice's name and put her where the As should go. Oh, Zoe is here, Z-O-E, so we'll put her down there. And then who else? Alex is here. Dammit, Alex, no room for you in our hash table, because Alice is already there. This is stupid. If we have data we want to insert into this data structure, it would seem that I have 24 available spots into which I could put Alex and yet I'm just stubbornly trying to put him where only the As belong. So, why don't I, in this kind of scenario, I need to put Alex in here. I clearly have space. You know what? Let me just probe the array looking for the first available spot. OK, Alex, you're just going to go here, and if someone else like Erin appears, fine. You just are going to go over here. So, you try to put the letter As where you want them to go, but if there's already someone there, just probe deeper into the data structure looking for the first available slot. So, this is a general technique in programming called linear probing whereby you have a data structure. If you hash to some location like the letter A there's a collision, something is there, you probe further in the data structure just looking for some place you can put it. So, you get close to constant time decision-making. Put A here, put Z here. And because this is an array, you have random access with your square bracket notation, but if you have lots of As and not too many Zs, or Bs, or Ds, it's possible this approach could devolve back into linear time. So, in the ideal we have one A, one B, one Z, and everything in between, that's constant time. We have our holy grail, constant time operations for a data structure, but not if we want to support insertion of other elements, even those that hash to the same location. So, what's the fix? Well, if the problem is that we've already made room-- we already have used this space for Alice, you know what? If we need to put someone else here, why don't we just create dynamically some more space? We have malloc now. We have dynamic memory allocation. Why don't we just extend our data structure laterally, horizontally-- artistically here-- so that, yes, you try to go to that first location. But if there's multiple people that are meant to go there, multiple values, go ahead and just link them together, thereby merging the idea of a hash table and a linked list with a data structure that might look like this. So, this is an example, somewhat arbitrary, of 31 days out of a month. And if you actually hash on people's birth dates, as I think this author did, you can think of your hash table still as an array. But that array does not store strings, it does not store integers. It only stores pointers, 31 total in this case-- some of which might be null, per the vertical diagonal slash-- but those pointers in turn point to the beginning of linked lists. So, if multiple people were born on the fourth of some month, you would put J. Adams and W. Floyd in a linked list at that location. If both Aaron, and Alex, and Alice, and other students with the names A all belong at that first location in my previous table, that's fine. Just string them together with a linked list. Much like with these buckets, at the end of the day, I'm still creating piles. And at the end of the day, I still have to go through them all, ultimately. But each of these piles is 1/26 the size of it would have been if everyone just came up at the end of the exam and just piled all their books in the same pile. So, whereas, these algorithms at the end of the day are still devolving, if you will-- or these data structures are devolving, if you will, into linear time operations, in the worst case if these things just get really long and stringy, at least in actuality they might be as good as 1/31 as long or 1/26 as tall. And so, now there's this dichotomy in this week five of asymptotic running time, the theoretical running time that we've really been belaboring and the actual running time. Just because something is n squared does not mean it's bad. If there's only a few elements, n squared is great. It's going to happen super fast if your computer is 1 gigahertz, or 2 gigahertz, or faster these days. N squared in and of itself isn't bad. It just gets really bad when your data gets large. But in practice, even n squared divided by 2 is actually better than n squared. So, a couple weeks ago when I was saying don't worry about the lower order terms, the constant terms, focus only on n squared and not n or anything you're dividing by, that's fine theoretically, but in actuality you're going to feel that kind of difference. So, here's one last data structure that we'll call a trie-- so trie, short for retrieval somehow, T-R-I-E, but pronounced try. And this one is cool because this now is really like a weird offspring of these data structures from today. But it's a tree each of whose nodes is in an array. And a trie is really good for storing words like words in a dictionary. Indeed, one of the problem I had for you in CS50 is going to be to implement a spell checker, which effectively means build a dictionary in memory, and you'll be challenged to spell check words as fast as you can, storing as many as 100,000 English words somehow in your computer's memory and answering questions of the form is this a word, is this a word, is this a word. That's, after all, what spell checking is. So, a trie is kind of interesting in that-- and this is an excerpt of an, artist's rendition there of-- the root node here represents this-- is this rectangle here, and that of course looks like an array. And notice what's implicit in this. If this is location A and this is location Z, the author here has just decided to only show you those letters that matter for the sake of discussion. But the fact that the M location here is not blank means there's a pointer there. Indeed, what are these arrays? They are arrays of pointers to other nodes. So, the fact that M is not null and it leads to this node, and notice that A is not null and it leads to this node, and then this node, and then this node. And this is where the artist is just taking some liberties. This tree would be monstrously wide, because all of these arrays are so darn wide, so he or she is just showing you the width-- or the element that we care about, M, A, X, W, E, L, L, and then some special sentinel symbol delta, but it could be anything. This is null, really. This is how using a trie a programmer could store the name Maxwell, M-A-X-W-E-L-L, by simply leaving little bread crumbs, if you will, from one node to another such that each of those elements in the array is a pointer to another array. And if you keep following these pointers, following the bread crumbs and you eventually find yourself at this special sentinel value-- and actually, it wouldn't be null, it would be like a Boolean saying true. This is a word you can just by storing a single yes or no at this location way down here, implicitly reveal that M-A-X-W-E-L was in fact a word. Let's follow another. So, let's say Turing, T-U-R-I-N-G, check, Boolean true. Turing is in this dictionary as well. So, if there are bread crumbs that lead to null, that word is not in here. So, apparently there is no names starting with A through L, and there is no one after U through Z or some of the letters in between, because those pointers are implicitly and pictorially null. But let's consider, then, what is the running time of inserting or looking up a name and [? in a trie? ?] Thus far, pretty much all of the data structures we've talked about have pretty slow running times, linear in the worst case. So, if we used an array to store people's names or we used to linked list to store people's names, in the worst case we had linear running time, unless maybe we sort things, but even then that costs us some time. So, linear may be logarithmic was the best we could do. And even with a hash table, whereby, maybe we store Maxwell at the M location in our table, he might still have a link list of a whole bunch of other M people. That, again, can devolve into something linear, a linear linked list. But what about a hash table? To answer the question is Maxwell in a trie-- sorry, what about to trie? To answer the question is Maxwell in a trie, what do we do? We start at the root and we follow the pointer that represents m, and then we follow the pointer there that represents A, then X, W, E, L, L, and we look for at the end of that series of steps a true false value. And if it's true, yes, Maxwell is here. What about Turing? Well, we start at the pointer that represents T, then U, R, I, N G, then check. Oh, true. Turing is in there. Let's look for David. No, false. There's not even a pointer there. David is not in this dictionary. So, how many steps did that each take? To tell whether Maxwell was in the dictionary, was M-A-X-W-E-L-L and then look at the Boolean, so that was eight steps. And to look up Turing was T-U-R-I-N-G. And then that Boolean, that was seven steps. Those numbers have nothing to do with how many words are already in the trie. There might be-- and there's only a couple dozen here-- there are a dozen or so here-- there might be thousands of actual words in this dictionary, but we're still going to find Alan Turing by way of T-U-R-I-N-G Boolean seven steps, and M-A-X-W-E-L-L Boolean, eight steps. It doesn't matter how many other data elements are in this trie. And that's what's powerful, because if there is an upper bound on the number of letters in an English word-- which is kind of true. I've rarely typed words that are longer than I don't know 10 characters, 15 characters. At some point there might exist these words, but no one actually says or types these words. Those are effectively constants. The maximum length of a word in English is surely some constant, because there is one word that's the longest. That's a constant value, which means inserting a name, or searching for a name, or removing a name from a trie does depend on the length of the name, but it does not depend on how many pieces of data are already in the data structure. And as such, it is constant time. So, now in C, we have a whole bunch of new syntax with which to represent data structures, namely actual structs in C, and we have pointers, and we have malloc with which we can build more interesting shapes, if you will, in memory. And we now have a number of abstract data types and actual data structures we can build using these ingredients with which we can now solve problems that are going to demand all the more resources, all the more time, all the more space, in which case efficiency and good design is going to be ever more important. All this and more next time. AUDIENCE: She thought she was doing the right thing. [AUDIO PLAYBACK] -Tell me more. -David was sure it had to be the muppet, something called muppet mode, but the pressure was too much. -This is Mario in muppet mode. Take 23. [HONKING] -What's happening? I thought this is what you always wanted, to star in the walkthrough videos, to have all of YouTube's eyes watching you. [HONKING] -Yes, you are. You have to be. Now stand up straight, tuck in your shirt, look into the camera! Take it again from the top.
B1 US pointer node list array tree data CS50 2016 - Week 5 - Data Structures 89 10 andrew posted on 2017/07/07 More Share Save Report Video vocabulary