Subtitles section Play video Print subtitles [MUSIC PLAYING] DAVID MALAN: All right. This is CS50 and this is lecture five. And you'll recall perhaps that in this most current week, you've probably been tackling a little something like this. And if you don't actually have this experience from childhood, what we're referring to in this problem set is these kind of glasses that aren't 3-D glasses because it's both red eyes or sometimes it's just a big piece of plastic. But with this can you actually look up and see what the answer is supposed to be. And so this is the allusion to which we're referring and the goal ultimately in problem set five's who done it is to actually figure out how to implement that kind of red filter. But to do that, you first have to understand this thing, which at first glance, admittedly, looks pretty complicated. But if you dived into the problem already, you'll probably have wrapped your mind around at least a few of these fields like the size of the image or the width of the image and the height of the image, which should be a little more reasonably straightforward. But to implement this, you've had to deal with something called a struct or a structure. And so in C, we have this feature, recall. And we didn't really play with this that much last time. But you've seen it now in forensics, or you soon will. And here we have the definition of a student. So when C was invented decades ago, they didn't foresee the need for a student data type. They had int and char and float and double. But we can invent our own data types much like in Scratch. We can make our own puzzle pieces as follows. Typedef to define a type, struct to say here comes a structure. And what is the structure known as student? Well in this case, I arbitrarily decided that a student would just have a name and a dorm and both of those would be strains. And you can imagine putting other things in there like ID numbers, phone numbers, email addresses, or whatnot so that you can actually combine all of this information together. So let's just take a quick look at how you might use code like this. Here is a file called struct.h. It's common, but not necessary to declare your structures inside of a file that also starts with .h so that we can share it across multiple programs just like with other libraries' header files. And here I've taken those training wheels off as before where, string is actually just a white lie for char star. But this is really the same data structure and it's in a file called struct.h. So let's take a quick look now at a program that actually uses this in struct0.C. So let's take a look at what we've done here. In struct0.C we have some header files up top. But we also include this header file so that we have access to this new custom data type. And then in main we do a few things. We first go ahead and ask the user for an integer called enrollment. So hopefully they'll give us a positive number. If we then do get back a number as expected in line 13, what do we do in English here? How would you just describe what line 13 is doing at this point in the term? Anyone, yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, give us an array of students of size enrollment. So even though on line 11 and prior we didn't know how many students we needed, we did get on line 12 that answer, the enrollment. And so on line 13 we declare an array using a variable saying, give me this many elements, this many students in my array to store things. And then we proceed, in the lines below, as follows. We start iterating over the enrollment from zero on up to enrollment. And we prompt the user on each iteration for a student's name and dorm. And the right hand side of those two lines of code is pretty familiar. You're just calling, get string. But on the left hand side, we do have a slightly new piece of syntax. We have students bracket I, which gives you the i-th students in the array. But what piece of syntax perhaps jumps out at you? Especially if you've never programmed before? And we've not used this symbol just yet in this context. What looks different? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, so the .name. So you can probably infer that .name and .dorm is somehow accessing the student's name and the student's dorm. And that's literally all that's happening. This dot operator tells the computer, go inside of the student's structure at that i-th location and store the string that's coming back from get string in that variable. And similarly, do the same for dorm. So it's like taking a struct and then looking inside of it and going very specifically to one of the elements therein. We've never needed this .operator before because in the past, any array, any variable we've had has just been a string or an int or float. We haven't had anything to dive deeper into. So that's all that's going on there. We've encapsulated, so to speak, inside of a student structure, name and dorm. And then this last part is actually just a printing out of that same information. We're just printing out, so-and-so is in such-and-such a dorm by passing in those two strings using our familiar percent S. Now this program at the end of the day is kind of pointless because you prompt the user for some number of students' names and dorms, you print them out, and then you throw them away forever. And that's not all that useful of a program long term. And so we have in our second version of this program, struct 1.C a new trick, too. That's a teaser as well for the direction we're going in this problem set, next problem set, and beyond, where we're actually using files where files on a computer are just a whole bunch of bits, zeros and ones. Those zeroes and ones follow some pattern. But we have yet to see a mechanism for actually saving files. But here's how we can do it. So above this line here, 21 and above, same program. Just get a bunch of students from the user, per their name and dorm. Then here line 24 we see something that's a little new, though you have seen this in the forensics problem so far. We call a function called F open, meaning file open. That takes two arguments, according to its documentation. The name of the file you want open and then the second argument is how you want to open the file. And even if you've never seen this before, what might the W there represent? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, right. So it's read and write are two of the most common operations for a computer. R would be read, W would be write. And this kind of makes sense if the goal now is to save these students to a file so that the program is actually useful if you run it again and again. So here we have a new data type on the left. It's all caps, which is a bit of an anomaly, even in C. But file, star, file just says, hey, give me a variable of type file that can store the address of a file, so to speak. And technically, that's not the address of the file on disk. That's the address in RAM once you've opened the file. But for now, just assume that this is an abstraction for the file. And it's just called literally file. So what is line 25's purpose in life? Even though we've never written this code before now. Yeah, what do you think? AUDIENCE: If the file exists. DAVID MALAN: If the file exists, or more generally, if the file was successfully opened. Because there could be bunches of things that go wrong. One, as you've implied, the file might not exist. Two, maybe you don't have permission so it exists but you can't open it. Three, maybe there's not enough memory in the computer to open yet one more file. So any number of things could go wrong. And recall, what is the special sentinel value that typically represents errors now that we're in the world of pointers? Null, so N-U-L-L in all caps is sort of the opposite of having a valid address. It means no such address. So we use it to check for errors. So that's all line 25 is doing. And then the rest of this is almost identical to the previous program, except we're not using print F. What are we apparently using? F print F and take a guess as to what the F in print F stands for? Yeah, so file print F. So it works almost the same except it takes one more argument. The very first argument now is not a format string like it's been for ages now. It's instead a variable that references an open file and then the rest of this is all the same. So what's really cool about F print F is that you don't just print to the screen. You actually print to whatever file you've opened so that then, on the very last line here, when I call F close, that's equivalent to saving the file. And then the program just quits. And so what's neat about this in the end is if I go ahead and scroll up here, and let me go into my source five directory. Let me go ahead and make struct one. Structs.h not found. What did I do? Well, I just screwed up before class and misnamed this file so that didn't just happen. So now I've compiled the program. All right, so now let me go ahead and run this ./struct1 enrollment will be three. And we'll say that it'll be Maria who is in Cabot House and Brian who is in Winthrop. And say David was, say, in Mather. Enter and nothing seems to happen. But if I type LS now and look inside this directory, notice that I have a file called students.csv deliberately named because if you've ever used Excel or Numbers, a very common file format is what's called the CSV, comma separated values format. And this is sort of like a very simplistic database. If I open this you'll see that indeed, the contents of this file are separated by commas. And if I were to actually open this file up in Excel, each of these columns would open up visually in exactly that. So what I did with my printdef, if I go back to structs1.c, notice as I consciously included that comma there, to create this sort of faux database format. And just for good measure, let me see if I go to download from the IDE's file manager and I go ahead and open up students.csv. And then if the program cooperates here, we have Microsoft Excel. And now I've made myself a tiny little spreadsheet. Now using c-code. Now we're going to find pretty quickly that this is not all that useful to make CSV files. Because the more and more rows we add to these files, the slower and slower it's going to get to search them. And so before long, as we transition next week and be onto web programming, we're actually going to replace spreadsheets or CSVs like this and actually replace them with something more powerful, namely databases. So that's a teaser then of what's to come. But where did we begin this conversation? It all kind of keeps coming back to what's inside of our computer, which we can continue abstracting away. You don't have to understand how this hardware works. But we previously had said that you can at least think about chopping up your computer's memory into a grid so that you can just number of the bytes. So that you have specific locations otherwise known as addresses or pointers. Last time we clarified that not all memory is treated equal inside of the computer. Rather, different chunks of memory are used differently. So the top portion, so to speak, but there is no notion of top in reality. This is just an artist's rendition. So the top of your computer's memory might be the heap, whereby you store certain types of values and then down here is the so-called stack where you store other types of values. And if we zoom out there's actually different layers still of memory. So let's actually tease apart what's going on here. If, when you run a program, you have access to a gigabyte of RAM or two gigabytes, and indeed, that's what your Mac or PC does. No matter how much RAM you have, the computer typically gives you the illusion of having access to all of it. And so this might be two gigabytes, then, of memory. Well, one of the first things that happens is that the zeros and ones that compose your program, whether it's called A.out or Caesar or Vigenere or Structs One, those zeros and ones are loaded way up top here in your computer's memory. So the text segment in memory is a weird name for the zeros and ones of your actual program. It's not ASCII text. It's like literally zeros and ones of your compiled program. Below that are what are generally called initialized data or uninitialized data. And this essentially just means any global variables you have in your program are stored here or here. If you gave them values at the top of your program, they're initialized by definition. And if you didn't, they're uninitialized. So the compiler just kind of lays those out a little bit differently. At the very bottom are something called environment variables, which we don't use too much but you use them in a few weeks for web programming. You'll often store things like user names and passwords or other values that you don't want to save in your code. But you want the Mac or PC or server to have somehow access to. But these are the ones we'll talk about the most, stack and heap. And we saw a couple of examples of each of these, though, briefly. What did we use the stack for or claim it's used for last time? AUDIENCE: [INAUDIBLE] DAVID MALAN: Say again? AUDIENCE: [INAUDIBLE] DAVID MALAN: Int min void, and more generally, functions when they are called. Main is, of course, the go to one for all programs' beginnings. But it might call another function like swap. And swap itself might call something else, maybe printdef gets called. So every time you call a function, it gets a slice of or a frame of memory. And they go up and up and up as those functions get called. And this was ultimately illuminating, at least theoretically, as to why this program was broken. The program we looked at last time, the swap and no swap programs, we claim that this implementation was wrong. And yet I think when Kate came up and we did the example with switching the Gatorade flavors, this is pretty much an interpretation of that into code. And it's correct in one sense and it's incorrect in another. In what sense was this code actually correct? In the no swap program. Because we did walk through it briefly with the debugger. AUDIENCE: [INAUDIBLE] DAVID MALAN: OK, we needed temporary container or variable in which to store one of the values or one of the Gatorade flavors. And by the time we got to this third and final line in this function, what could you say about A and B? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, they had in fact been swapped. And I saw that, I think, by plugging in, I think I struggled with the debugger so I used E print def at the last minute just to see what had happened after that very third line. So it works. This function does swap A and B. But it did not swap what? AUDIENCE: [INAUDIBLE] DAVID MALAN: X and Y, which were the variables in main. And so recall the story we told last time, was that if we focus only on your computer stack, that sort of bottom portion of memory, when main is called, it gets a chunk of memory down here at the bottom because it's the very first function to be called. And it had variables, recall, called X and Y whose values were one and two. When main called swap, the other function we just saw, it had values called A, B, and also temp that initially were one and two. And eventually became two and one. But that picture kind of answers the whole question. The reason X and Y didn't change is because you literally change in that red version, just A and B. So we solved this problem, recall, last time with what new feature of C? AUDIENCE: [INAUDIBLE] DAVID MALAN: Pointers, so addresses. Rather than just hand the function, the values you want to swap, give the function a road map, so to speak, to those values so that the function can go to the values you actually care about and move them wherever they are. And it's a strange looking syntax at first. It looks like multiplication all over the place. But it had two different uses. If you have the star or asterisk up here and a data type like int next to it, this is saying, hey computer. Give me a variable called A. But that's not going to store an int. It's going to store what? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, the memory address of an int and just as B is going to store the address of another integer. So those are kind of placeholders for the addresses. So down the road is the Science Center. So if the address of the Science Center is 1 Oxford Street, Cambridge, Mass 02138, that is it's postal address and it uniquely identifies that building. Similarly inside of a computer do values have unique addresses. They're just much simpler. They're numbers, they don't have streets and zip codes and all of that. But it's the same exact idea. So here we still have a variable temp of type int. So give me a temporary variable just like Kate had the extra cup that was initially empty. *A, though, without a data type to the left of it, was like saying what? *A in this context is a sort of different English sentence than this one. This means give me a pointer to an int or declare for me a variable that will store the address of an int. But this says, go to that address. So it means A is in address, *A means go to that address so you can get at the value, which probably in the story, is one. *A means the same thing. Again, go to that address. And then, by the way, go to that address B. Whatever that value is, put it where this finger is already pointing. And then *B means go to that address and put whatever was in the temporary cup of Gatorade, or in this case, the value one. So pointers, though kind of a very convoluted way of fixing this solve the problem fundamentally, because now rather than passing one and two, we instead passed in here the address of X and the address of Y that allowed the computer to then go to those locations in memory and actually do what it is we wanted it to do. So long story short, this is how the computer stack is used. When you call a function, it gets a new slice of memory on top of whatever function called it. As soon as that function is done executing, this memory effectively goes away. It's technically still there, because it's hardware. It's not going to physically disappear. But now whatever next function main calls, maybe printdef, maybe something else, it will reuse this memory in any number of ways that it wants for its own local values and parameters. So given that definition, the fact that the stack, kind of like trays in the cafeteria grow up and down and up and down as a program calls functions and those functions return, where do garbage values actually come from? AUDIENCE: [INAUDIBLE] DAVID MALAN: Old things that were in that memory. So I kind of made this sweeping claim in the past that you shouldn't trust variables if you yourself haven't put values in them, because they have so-called garbage values. But that actually has a very precise meaning. That means if you have called a function and it needs a variable that just happens to be here, and this is like a minute into your program's running, a whole bunch of different functions might have used and unused, used and unused that portion of memory. So they're just going to have zeros and ones lingering there in some pattern. The computer could be really defensive and it could just change these all to zeros bits all the time. And that could have been a reasonable design decision, but long story short, C does not do that. It would have just been time consuming, especially years ago when computers were slower. The language was younger. And it just wasn't compelling to do that. So you just get garbage values, which I have typically just written as question marks. But that's why. There are garbage values there because they're your own previous values, or those of some functions. So the heap, meanwhile, was fundamentally different. So the heap is this upper portion of memory that is in some sense conceptually above the stack. And it's up here. And that's different in the sense that it's more for long term storage. The stack is for short term storage, just to use locally when a function is executing. But suppose your program is to run for a while, or suppose you want a function to allocate memory that does stick around and does not just immediately become garbage values. In fact, think about GetString. GetString is a function we wrote and its purpose in life is to get a string from the user, which is a whole bunch of characters. And consider this. If GetString is called and therefore gets a slice of memory on the stack, and I type in Maria's name, M-A-R-I-A and then that gets a secret /0. Where is the M-A-R-I-A and /0 stored? Can it be, by this definition, stored on the stack? Why? AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. So if we allocated space on the stack, which we could do with an array, and then return the address of that string, that would be valid in some sense, because you would have allocated the memory. And I'll go ahead and draw it like this. If we had a function, GetString being called and GetString's being called by main, that might mean that GetString has this chunk of memory here. And if Maria or I go ahead and type in her name, that's like allocating space for M-A-R-I-A /0. And you can think of this as just being a whole bunch of bytes in that frame. So they do exist and they literally are stored in memory. And I could return the address of this first byte, whatever this is. Maybe this is byte 10 or 100 or whatnot. And I could return that address to main as GetString's return value. But as soon as I do that, yeah, exactly. The memory doesn't technically go anywhere. But it's no longer trustworthy. All of that is now garbage value. So you might get lucky. And if you try to print GetString's return value you might see Maria but maybe briefly, because the next time you call a line of code or somehow that memory's reused, Maria's name might get overwritten with some other values because her name becomes, by definition, a garbage value. And you don't know when it's actually going to get reused. So that's not safe. So this is why the heap exists. If you need to keep your memory around for a while, like GetString is supposed to do, turns out you can allocate it just elsewhere that won't disappear until you yourself free it. And so that's when we introduced last time a couple of new functions, malloc for memory allocation and it turns out there's an opposite of it, free, which you'll need to use for future problem sets dealing with memory management in order to undo the allocation here. Otherwise you end up having what's called a memory leak. And the computer might slow down, run out of memory, because you're not giving it back. And as an aside, it turns out there's a couple of cousins of malloc, calloc and realloc. Calloc is kind of cool in that the C means clear. So calloc is identical to malloc, but it zeros the entire chunk of memory for you. So if you just want to initialize to have no garbage values whatsoever, you can use calloc instead of doing it yourself with a four loop or something like that. Realloc, we're going to see, is a more powerful function that allows you to take a chunk of memory and somehow grow it. But we'll see what that actually means in a moment. But with this power comes great responsibility. And we saw that things can go horribly wrong for binkie when you misuse memory addresses. And recall that we looked briefly at this program by way of Nick's video from Stanford. And let's see what these lines of code actually represent here. So here and here I'm declaring two variables, X and Y, that are going to store what, generally speaking? Addresses of integers. So that's all that's happening there. This now, was a new line of code last time. Where it's saying, call malloc, so allocate some amount of memory. How much memory do you want? Whatever the size of an int is. Odds are it's going to be four, maybe eight, some power of two or some multiple of two here, or a multiple of four. So here we get back what? A chunk of memory or specifically, its address and we store that in X. Meanwhile this line says, go to that address and put the special number 42 there. This next line blindly says, go to the address in Y and put the unlucky number 13 there. But that's where binkie had an accident because what was inside of Y at that moment in time? AUDIENCE: [INAUDIBLE] DAVID MALAN: Memory it wasn't allowed to touch. And why? Be more precise. AUDIENCE: [INAUDIBLE] DAVID MALAN: So close. That's going explain the symptom ultimately but? AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. I didn't ask for any memory whatsoever. So by default, even though this looks funky, int*y just means give me space for an address. So that means, give me four or eight bytes of memory in the computer's RAM. But what is inside of a variable by default, have we claimed? Just garbage values. So there's going to be a number there, and numbers are our addresses. So it's going to look like there's an address there. So it is technically correct to say, go there. But it's like following a map where you have no idea where you're going. You might sort of walk off the edge of wherever you are. And that's when bad things happen. And so the visualization that Nick put together with claymation was this. If you have this and it turns out it doesn't matter if the star is on the left or on the right here. But we have conventionally put it over onto the right side, next to the variable. When you do int*X and int*Y, that's like saying, give me a chunk of memory or clay for each of these variables. And he just kind of circled the little arrowhead there on the string because there's memory for it. It's just not pointing anywhere specific. It's a garbage value at this moment in time. The next chapter in this story was this. Allocate space for an int, drawn in white clay here. And Nick, because of the assignment, said X, which is again is a pointer, is now going to point to that chunk of memory. So it's no longer a garbage value. It points somewhere specific. That is why Nick was then able to say, go there. Follow the arrow, and put the number 42 there. But the next line of code, this one went horribly wrong because Y was not pointing anywhere. Nick tried to say, go there and put 13 but there is nowhere so the computer crashes. A segmentation fault, meaning that you touch the segment of memory that you should not have because you tried to go somewhere where it was just some garbage value. And the fix, recall, might be this, or a solution. If we instead kind of rewind and fix binkie. And say Y equals X, that's not allocating extra space. That's just saying, have Y point at the same chunk of memory as X. Because again, X and Y are just addresses. So if the address is 100 in memory, now X is 100, Y is 100. They're both pointing at the same chunk of white clay. So if he then did *Y, gets 13, that says, go there. Update the number. And now 42 became 13. Very similar in spirit, in fact, to our capitalization example, when we pointed to strings last time, at the same chunk of memory. So any questions on the stack or as depicted here, by binkie, the heap? Malloc allocates memory from the heap. But anytime you declare local variables or arrays inside of a function, that ends up on the stack. And thus far malloc is the only tool, or calloc or realloc, that gives us access to this new portion of memory depicted in white clay here and sort of depicted in our diagram up above. Any questions? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: Good question. So keep in mind that X and Y are both the same data types. They're both pointers to addresses. So as such, if you're going to set one equal to the other, you have to just store the value that's in one inside of the other. Otherwise it would be trying to put an integer inside of a pointer, which isn't quite correct even though they're technically both numbers. So this is just saying, whatever address is in X, put that same address in Y it says nothing about going to that address. It's like making a photocopy of a map but not actually following that map yet. Until this line, which then says, go follow the copy of the map. And it turns out it leads you to the same location. Yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: So if you return, in C you can only return one value. So you're kind of in a bad spot because you ideally want to return two values, right? Both A and B you want to return B and A. There are ways you can work around that and kind of sort of return two values and see. But in short, it's much easier said than done. Python will actually make that much easier. But the short of it in C is you can't return multiple values. And that ties our hands in this case. OK, so odds are, related to all of this, you've heard about this website, which is enormously useful when you're trying to learn some new language, when you're out in the real world trying to solve some problem because this is this wonderful community of people who post questions and answers and ideally link to canonical references so you can kind of understand why some answer is the way it is. But this actually has a very technical meaning, Stack Overflow. And stack, of course, is now a familiar concept. And you can imagine something like this picture eventually going very wrong if you call so many functions that you just kind of run out of memory. Or not even run out per se. What are you going to hit eventually if you keep calling function after function after function? The heap. And then bad things are going to happen. If your stack frames and your local variables are so numerous that you start overwriting the heap memory, now values that you allocated with malloc might themselves get clobbered. So this is not the best design decision but it is the reality. And it's mitigated only by using a compiler that might help you notice this or by actually using more memory than you actually need. Now, Stack Overflow actually has a very technical meaning here, as does heap overflow. Stack Overflow means you overflow the stack. You just call so many darn functions that you just touch memory you shouldn't. Heap overflow would be the opposite. You keep calling malloc and malloc and malloc and the heap overflows the stack. Because for better or for worse, the stack is growing this way and the heap is growing this way. And eventually they're going to strike each other. So this is a more general way of saying buffer overflows. A buffer it's just a chunk of memory that stores data or values. We people in the real world might have heard of this in the context of video, YouTube, or various video players. If you've ever seen the expression buffering, dot dot dot. It's the most infuriating thing. Something's about to happen in the movie or show you're watching. And the damn thing start buffering, buffering, buffering. What does that mean? Well, the video player, YouTube or something else, has a chunk of memory, which you can think of as an array. And loaded into that array are the zeros and ones that compose the movie or TV show you're watching. And those were downloaded over the internet. And what happens is, hopefully your internet speed is faster than the movie's own playback. So that even though you might be at the minute 10 in the video, hopefully your computer has downloaded minute 10 through 11 so that you have this built up buffer of bytes that you have a whole minute where you can watch, even if your internet connection goes out. But when your video is buffering, it means you have this array of memory and you've kind of looked at or watched all of the bytes in it. And the buffer is now empty. But the opposite can happen, too. If you try downloading more bytes then you have memory for, you might try putting minutes of the video someplace they don't belong in your computer. Or if you call too many functions or if you call malloc too many times, you might overflow the chunk of memory that's been allocated. So buffers are all over the place. And indeed, a string as we know it is just a buffer. It's an array of memory and hopefully you will only put as many characters in that string as can fit in that chunk of memory. So what kinds of things can go wrong? This is a bit of a contrived example, but it comes with a couple of visuals just to paint a picture of how adversaries can hack into systems that are written in languages like C. So here's a quick program. We're going to include string.h. And down here we have int main that takes command line arguments. Notice this function does not do any error checking at all. It's pretty stupid. It just calls a function foo and passes an argv[1]. So the idea here is that this is a program that if you take a command line argument, a word after the program's name, just gets blindly passed into the foo program. OK. So next, what does foo do? It accepts as input a string, a.k.a. char* and we're just calling it bar. And then it allocates an array on the stack called C of size 12. And then even if you've never seen this function before, you can maybe kind of infer from its name, mem copy, like memory copy. So it turns out this is going to copy into this memory whatever is in this memory up to this many bytes. So if I type in Maria as the command line argument, M-A-R-I-A is five. So that means the length I typed in is going to be five. And this is going to copy five bytes from bar into C that's it. Now it's meant to be just a monster. This program is pretty useless at the end of the day. But it's kind of distilled a thread into the fewest lines of code. What does this actually look like or what's happening? We've called the function. We've allocated 12 bytes. We've copied those five bytes into those 12 bytes. So all is well in this story. But what actually happens in memory? So here's a picture of the stack kind of zoomed in and nicely colorized. So stack is going this way. Heap is growing this way. And this is just showing you technically how things are laid out on the stack. I keep kind of simplifying the world by just drawing things as X and Y and A and B. But they actually follow a precise order. So specifically, if we have a local variable called bar, which we did for this function, it goes right there at the bottom. If you then declare an array called C, it goes right up there on top. And these are sized proportional. This is four bytes. This is going to be 12 bytes. So it all is kind of proportional size. And then it turns out, and we won't go into too much detail, but if you like this stuff, CS 61 and other classes will explore it, it turns out another thing that has always been tucked away on the stack secretly is what's called a return address. So when main calls swap, it's like handing the keys to the car off to someone else. Like swap, go do your thing. But main kind of has to tell swap or any function it calls, how to get back to its chunk of memory so that execution can resume with main as soon as swap is done executing. And it's not its stack memory, per se. Recall that top portion of memory that I described as the text segment? All the zeros and ones that compose your program? It turns out that main, when it calls swap or some other function, it tucks its own return address, the address of the appropriate zeros and ones in that text segment, into four bytes here, or maybe eight bytes, the address to which swap should hand the keys back to you, so to speak. Otherwise it's like main handing the keys off to another function and then it never hears from it again so main's other lines of code never get executed. So long story short, there needs to be an address or a map tucked away on the stack so that swap can hand control back to main. But what happens here when you actually use this memory? Well, it turns out that if we just number the bytes on the stack, and that was a size 12, the first one is zero, and the last one is 11. So zero through 11 gives us 12 total. So if we type in something like Maria or maybe more generally, hello, H-E-L-L-O, which is the same length, that's using six bytes technically, because the /0 and all is well. Fits comfortably in C and we've got room to breathe. But what if we don't type in Maria or hello? What if we type in a very long sentence that's more than 12 characters? Where are they going to end up? If you type in a longer string at the command line in argv[1], notice the code is flawed. You're going to check the length of the word that the user typed in, copy all of its bytes from bar into C. But what if the length of the string you typed in is 13? What are you going to do? You're going to copy 13 bytes from bar into C, thereby filling all of these 12 bytes plus one more that you shouldn't be touching. And if the string is even longer than 13, if the adversary really typed a long sentence or phrase or word or whatnot, you're going to really exceed the boundaries of that buffer or that array. So what does this look Like Well, if you type in a much longer word, like A-A-A-A-A-A-A, you could end up overwriting these 12 bytes, also these four bytes, also these green bytes, whatever they are, and most importantly, even the red bytes that I described as the return address. Now A-A-A-A-A really isn't going to cause anyone any trouble because it's just a sequence of random ASCII characters. But characters at the end of the day are just numbers, and numbers are just bits, and programs are just bits. So the threat here is that if you're a pretty sophisticated adversary, someone who really knows programming, you could technically write a program that does something really bad like delete all the files on a hard drive. Or send spam to everyone in your contacts. Or anything like that because at the end of the day the program that he or she has just written is just zeros and ones. If you then convert that program zeros and ones to the corresponding, even if weird-looking ASCII values, you could technically type a program at the command line in argv[1] just by typing out the funky characters on the keys that are not going to make sense to a human reading it. But those ASCII characters in the context of a program are going to be interpreted as code. And if you're really good, and frankly, it's not so much that you're really good. If by a lot of trial and error, you happen to overwrite the return address in a program, you can trick the computer into not returning back to main, but to jumping to the very input you passed into the program. So A here implies attack, like attack code. So if you're really clever, you can pass in an appropriate pattern of zeros and ones, convert it to ASCII so the human can type them in at the prompt, overwrite this return address, and trick the computer program to return from this function not to main, but to like, this byte up here. And maybe this byte coupled with all of these others means delete all the files on this user's hard drive, send spam to everyone in their contacts. This is called a buffer overflow exploit, and it's incredibly shockingly common even these days. C is not commonly used for a lot of programs but it still is everywhere. And there's other languages, too, like C++ that lend itself to this. And even though this is still a little arcane and you don't need to worry too much about the addresses on the screen, the fundamental threat here is that if you do not check the boundaries of your arrays and the amount of memory you've allocated and you touch memory you should not, very bad things can happen. You're effectively giving control to anyone on the internet who can use your program because he or she can be clever enough to inject their own zeros and ones into your program for execution. OK. So dear God, this is scary in a computer science sense. So what can we do to defend against this beyond just not writing bugs, which is never going to happen, right? Even the most advanced, best programmers still make bugs, especially as the software gets more and more complicated. We have eprintf and we have help50 and we have debug50 but there's other tools, too, like Valgrind which happens to be a tool for detecting memory leaks in a program and other memory-related issues. So let me actually go ahead and open this program, memory.c. And it looks like this. And let's see if we can't tease apart what is buggy about this program. So here's the program here. So, include standard lib.h, function f, function main, main calls f and returns 0. Nothing really interesting going on there. So what's in f? F on the right hand side allocates space for 10 integers, I think. Malloc returns the address of that chunk of memory and stores it in X and then line 8 is the bug, I think. What's wrong with line 8? Let me go here first. AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly, because we start counting at zero and because we're asking the computer for space for 10 ints, we get them back. But that's going to be accessible via [0] through, as you say, [9]. So [10] is like writing one int past the buffer, a.k.a. A buffer overflow. Now it might be hard to see this, especially when the program isn't as relatively short as this but is buried in a dozen lines, 100 lines, thousands of lines of code. But tools can help. So if I go ahead and make this program with make memory. And then I go ahead and execute Valgrind ./memory enter, the one downside of this program is that its output is just completely overwhelming. But let's see if we can't tease apart some recognizable terms. So all of this on the left is a bit of a distraction. This is just copyright information. So the interesting stuff seems to happen here. I see invalid right of size 4. Not quite sure what all of this means, but I do see that it's somehow related to uh-oh, line 8 of memory.c line 13, line eight, line 13, a lot of bugs in the same places, it seems. And then here, address such and such as zero bytes after a block of sys 40 alloc, it's a little hard to wrap my mind around this. So as always, let's at least initially just run this through help50 and see if it can help tease this apart. So we see the same output. It recognizes something familiar in yellow here. Invalid right of size four and it highlights the lines. And our TF-like feedback is, looks like you're trying to modify four bytes of memory that isn't yours, question mark? Did you try to store something beyond the bounds of an array? Take a closer look at line 8 of memory.c. So that's kind of a mouthful but it's just because we have practiced reading stuff that's pretty arcane like this. So we've extracted all of the salient details like line eight of memory.c. So line eight of memory.c, as we noted, is already the dangerous line. So what might it mean to have an invalid right of size four? Well, it turns out an int in the IEDE is how many bytes? Four, or 32 bits. So invalid right of size four just means that this int here, zero, is an int, it's four bytes, it's just invalidly being written, as you say, to the wrong location. So this is Valgrind's pretty terse way of communicating that idea. So here we have then an explanation. So how do I fix this? Well, if my intent was just to update the last area there, let me go ahead and do make memory enter, ./valgrind ./memory enter. And now this is a good thing except we've made some progress. Let me scroll up to fit more on the screen. So I got rid of that message, invalid right of size four, but this does not sound good either. 40 bytes in one block are definitely lost in lost record one of one, all right? So I need a little help with that. So let me do help50 again until I get familiar with the syntax. And it's highlighted that chunk of output. Looks like your program leaked 40 bytes of memory. Did you forget to free memory that you allocated via malloc? Take a closer look at line seven of memory.c. So let's do exactly that. So line 7 of memory.c, OK, here's where I malloc the memory. And per help50's own feedback, what have I apparently not done? Freed it. And it turns out freeing is actually pretty straightforward so long as you remember it do it. You just call free, passing in the same pointer. You don't have to remember how long it is. It's up to the operating system to remember how long it is. But now if I do make memory. And now I do again Valgrind./memory enter, heap summary, all heap blocks were freed. No leaks are possible. I see nothing particularly worrisome. And the program is bug free now. So Valgrind is another tool in your toolkit that doesn't help you find logical bugs per se. It helps you find memory-related errors, which might be logical bugs. But it helps you hone in on them and see them in a way that you as a human might not otherwise, especially if it's buried in many, many lines of code. Now you'll notice, too, real briefly, in Valgrin's output in these several examples, there are all of these funky numbers. So if I go back to the original version here just a moment ago, where it was in fact buggy in a couple of ways. And I rerun make and I rerun Valgrind, you'll see a whole bunch of things like this. At OX such and such, at OX such and such, OX, what did OX denote last time? So hexadecimal, so this is just a succinct if weird-looking way of representing numbers, generally memory addresses. And so this very specifically is saying that line 13 of memory.c happens to be using memory at this location. It's not particularly useful to us the programmers. But that's why you see it. And Valgrind is arguably a more advanced tool, which is to say that memory addresses in tools like this and even in debuggers tends to be written using hexadecimal notation like that. Of course, you've seen hexadecimal converted. Like these are the first three bytes in a JPEG, which are typically thought of using hexadecimal like this. But even though this looks new, it's the exact same idea. And I thought I'd tease perhaps with a joke that only a computer scientist can understand. OK, so that's a good one that goes around each year. So that of course is alluding to just these addresses. And now let me propose one other debugging technique and explain like, what the hell is going on here on stage today, too. So you have of course debug50, which is a tool for debugging and walking through code. And silly though this is, there is actually this thing in the world of programming, rubber duck debugging. This is, in the absence of having a TF or a CA to bounce ideas off of, this is in the absence of having a roommate around or roommate around who wants to talk to you about your code. It's recommended that if you have some bug in your program, that you keep something like this on your desk. And in the absence of roommates and friends and hopefully doors closed, you talk to the rubber duck. And I feel silly even saying this but there's a Wikipedia article on this it's a real thing. The idea here is that if you've ever been in office hours or you've been chatting with a TFer friend and just like talking about your code and talking about what it is you think your code is doing, just very often that act of saying something and hearing yourself say it can help reinforce one, what your code is in fact doing. Or if you realize verbally, wait a minute. What I just said does not seem to line up with the code, finally that light bulb goes off. And it doesn't have to be a duck. I mean, you can talk to the wall but that's a little stranger. So at least this is a personification of having someone like a colleague to talk to. So at the end of today or during break by all means, grab yourself a rubber duck the debugger and keep it on your shelf. It doesn't have to quite be this large. But this is a genuine debugging technique. Like, in the absence of understanding something, don't necessarily turn only to CS50 discourse or to office hours or to sections or the like. Literally try talking yourself through it, even if it feels a little bit silly. And if it does really feel silly, just look at him and talk to yourself in your head perhaps. But that kind of enunciation of what your code is doing or should be doing will hopefully help all the more light bulbs go off. And eventually you can just keep them on your shelf and take off that training wheel as well. Let's go ahead and take our five minute break here. Grab a duck if you'd like and we'll come back with more. All right. So we're back and we keep thinking about memory. Is this generally laid out as having addresses, but of course we've clarified that a little bit in that we have more of a canvas at our disposal now. But even then we keep talking about having things back to back to back in memory. But that simply needn't be the case. Like, what we have now with pointers and with malloc and these kinds of functions is the ability to get memory from anywhere we want and somehow stitch it together or connect these things. But how do we actually do that with the ingredients we now have? And why might we want to? So here is how we keep representing something like an array. An array, again, is just a contiguous chunk of memory where you store things literally back to back to back. But suppose that I've put six things into this array, six numbers, one, two, three, four, five, six. What happens if I try to put seven into this array? What do I have to think about or worry about? Touching memory that I'm not allowed to touch. So I'd better not put it over here. But what if seven must go in this array? Well, I don't have too many options. Like, if I fill the space I have to either overwrite some value or put it somewhere it shouldn't be, and that should never be an option because the program could or will crash. And so I could alternatively just allocate more memory. So how do I do that? Well, if I've allocated this array initially to be a size six, I could encode, just allocate a new array of size seven and then do what? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, add all the numbers and seven. So I can take this array, I can give myself another one elsewhere in memory, copy the values from old to new, then maybe free the old. And then move on with my life because now I have enough memory. Now, that fixes the problem. And if we implemented it in code correctly, it would be by nature correct, assuming there's enough memory in the computer. But why is that arguably bad design? AUDIENCE: It's a waste of space. DAVID MALAN: It's a waste of space how? AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, even though I don't keep both around in the story I'm telling, it's temporarily pretty inefficient in that I'm using twice as much memory as I actually need only to then kind of downscale. What else? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, if I want to change the size of the array again, whether bigger or even smaller perhaps, if I remove items from the list, then I just have to keep allocating new memory, which is wasteful and more importantly, it's not just space inefficient. In what other sense is inefficient? Time, why? Where is the time coming from? AUDIENCE: [INAUDIBLE] DAVID MALAN: So that's an asymptotic notation. This is copying something from one array to another would be in big O of what? They go of side, yeah. So if we just genericize size as n it's like big O of N. It's a linear time operation, which isn't horrible, right? N squared is bad. Linear has never really been bad but we already know that log n is better, constant time is best. And so just wasting any amount of time doesn't feel like optimal design. And that all is a function of arrays being a fixed size and contiguous in memory, back to back to back. In fact, arguably there's one other issue that could occur. It's not so much if you have a very small array. But what if you have a huge amount of memory available but it's only in size five or six increments? Like, for whatever reasons your computer's been using some of this memory, this memory, this memory, this memory, and if you add up all the available memory there's a lot of free space but they're always separated by memory that's in use. So maybe this memory is free, then there's a bite that's in use. This memory is free, there's a bite in use. So your memory is quote unquote very fragmented. So you have lots of available memory but it's not contiguous. You cannot, in this model, allocate an array of size seven if you don't have that memory available contiguously. So not as big of a concern given enough memory, but at least something that could arise. So let's introduce the solution. Something here called a linked list. And the name kind of describes what it is. It's still a list of numbers but it's linked by way of these arrows. And we've used arrows before. What have we used arrows to represent in pictures past? Yeah, so pointers. So now that we have the expressiveness of pointers, you can kind of digitally stitch your data structures together if you spend a little bit more memory. So we've not really solved the problem you identified, which was the space use. But if you're tolerant of that and if you've got enough memory at your disposal and can afford to spend it, why don't we store for every number not just the number but also space for a pointer? So each of the boxes I've drawn here now doesn't just have a box for the number itself n. It's got really two boxes together, one for n and one for something we'll call next, which is going to be a pointer to or equivalently the address of the next node, as we'll call it, the next box in this list. Now even though we've drawn it here very prettily from left to right, technically these boxes could be anywhere in memory, specifically in the heap, we're going to see. But they don't have to be back to back. And so the fact that there are these gaps in between the nodes deliberately paints a picture that these things don't have to be back to back to back any more. They can be anywhere. And now suppose I've got these five numbers, nine through 34. Suppose I want to add another number. Where do I put it? I don't seem to have room. But based on this picture, how much you infer we're going to engineer this. AUDIENCE: [INAUDIBLE] DAVID MALAN: Yeah, so why don't we just allocate space for another and it's not going to fit on the board here but who cares? We can put it here. Put the new number in it and then just add a line, an arrow from it to that. And so this then is going to be what we call a linked list and it gives us dynamism. It gives us the ability to grow or shrink our data structure, addressing your concern, not necessarily yours. It's still a little space wasteful, but we gain benefits for both of you, your concerns time is a lot better now because we don't have to waste time copying the values. We just add to the values. And if we want to grow or subtract it we only do as much work as we're trying to add or subtract. We don't have to worry about fixing everything. But there is some complexity here. And given that we have a whole bunch of these, which would make problems at five a little easier, if you haven't quite finished who done it. Could we get for just one demo today six volunteers? Six volunteers? All right. Come on down, right in front here. All right. Well actually, come on up. Come on up and have a two and three over there, four. No one over here today. OK, OK, five. OK, six. OK, six, six. Come on up. All right. We'll save this till the very end. But let me give you guys in the meantime, these numbers. And if you don't mind holding the numbers out I want you to go over to the left there and just order yourselves just as the picture on the screen OK, you'll be 17. Let's see, let's see, nine. I might have given out too many numbers. OK, let me free that and give you nine instead, if I may. And give you, let's see. You have 17? OK, so 17. And yeah, I'll have you go ahead and flip yourselves if you don't mind. 22, and then 26. And what do we got? 20, 34. 34. OK, and you guys will be slightly special. So who wants to be literally first? OK, so here first. And who wants to be temporary? OK, you'll be temporary, all right. Come on over here. OK, come on over here and if you guys could step a little closer. So we have 9, 17, 22, 26, 34, and give yourselves like a foot in between. And if you guys could use your left arms to represent the pointers to visualize who is linked to whom. OK, and why don't you just point, yes, very deliberately down. So what's your name? NAZLI: Nazli. DAVID MALAN: Nazli. So Nazli's left hand will be a null pointer. It's not pointing at anyone. So literally just pointed down to the ground, like ground electrically. OK. So now we just have to have some first node. So what's your name? OLIVIA: Olivia. DAVID MALAN: Olivia is a little special here in that her paper has a word and it's not just a number. She represents a distinct variable called first. Because that one catch with the linked list is that you don't remember it by way of the address of a contiguous chunk of memory. You remember a linked list by way of the address of the first node in the linked list. And what's your name? ACHMED: Achmed. DAVID MALAN: Achmed. So Achmed here happens to be the very first node in the list right now. So Olivia's left arm is going to be pointing to Achmed to represent that he is the first node in the list. OK, and what's your name? JESS: Jess. DAVID MALAN: Jess. We're going to use Jess in just a moment to complete some operations. So suppose that we actually want to insert some value into this list, like the number 55. All right, so the number 55 is going to require a little bit of cleverness here. And so I need some place to store this. I need to malloc. So OK, you've been volunteered. What's your name? STELLA: Stella. DAVID MALAN: Stella, come on up. So malloc Stella. And we will store the number 55 in Stella's node. And right now if you could just kind of point your left hand anywhere. It's kind of a garbage value. OK, thank you. And now what's your name again? JESS: Jess. DAVID MALAN: Jess. OK, so Jess is going to help us find the right space here. So we can obviously see where 55 belongs if we're keeping this sorted. But again, computers don't have that luxury. Moreover, we no longer have random access. We can't just jump to [0], [1], [2] because there are these gaps between them. And just to make this more clear, can every other of you step forward or back so that it just looks a little weird? So you can no longer index into this data structure because again, it's not an array. It's not back to back. These things could be anywhere in memory and it's only the pointers that are linking everything together. So Jess now is going to initially point at the very same thing that Olivia is pointing at, the same address or Achmed. All right, so now we have a bit of redundancy. But suppose we want to insert 55. What kind of logic, what's the pseudocode here for Jess to find the location for Stella? What should Jess do? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: OK, keeps going until she finds the null pointer, or more specifically, till she finds the null pointer or the right location for this number if we want to keep it sorted. So let's do that. So you're pointing at Achmed. The number nine is not greater than 55. So Stella belongs after Achmed. So what are you going to do? Good, you're going to point at Maria. So you know to point at Maria why? JESS: Because nine is less than 55. DAVID MALAN: Nine is less than 55 but also, Achmed isn't just story nine. Right, he has this next pointer that's telling Jess where the next value is to look. So his left hand is the substitute for what would have been just ++ the world of an array. So you go ahead and physically walk and let's just walk through this. So now we're pointing here at 17. It's not greater than. We point next at Arunev. OK, that's a 22. Still not the right value. We keep going. What's your name? Jeung Wan? OK, that's not the right number because he's holding 26. And now we catch you again? Nazli. Still no good and now go ahead and follow her left hand. OK, so now we know that this has got to be the right space because we haven't found numerically the right space. So if we could borrow you, Stella, all the way over here. Well, you're not technically physically moving in memory but this will just make the story better. OK, so yes, we're re-alloc-ing sort of. So what are you going to do now with Stella now that you found the right location? Leave her here, OK. But she's just kind of orphaned now. She's pointing at nowhere and no one's pointing at her. It's kind of sad. And this is actually perfect. Memory leak. OK, so let's fix. Who has the point at whom? OK, good. And now what should Stella point at? Since now she is the end of the list and she's just pointing to some garbage value. And she's pointing, to be clear, at some garbage value because when you call malloc you just get garbage values. We overrode one of those garbage values with 55 for n, but the pointer has not yet been overwritten. So what you want to do, Jess? To whom? That's OK. It's close. What should her value be if there's no one to her left? Should be null. OK, and how did we represent null before? Yeah, exactly. Null, so now we have a list and now just to fix things, your pointer, so Jess is kind of temporary. We don't really care what her value is. But who's important over here? What's your name again? Olivia was first and now do we have a list that is still linked? We do. And now, it of course took a little while to walk through this. And frankly I kind of told a lie. I haven't really made this faster because what was the running time of this algorithm? It was still a log of n. But that's because what? I was trying to maintain what property? Sorted. So suppose I relax that constraint. Suppose that I didn't care about being sorted order. Can I do better than 0 of n in order to have inserted Stella? AUDIENCE: [INAUDIBLE] DAVID MALAN: OK, Constan time, where can I put her then? AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. So if we don't care about sorted order, I could have saved myself a huge amount of time and we could have inserted Stella here, updated Olivia's hands, and updated Stella's hands to point at Achmed. And then we're done. Constant time. And here's an example of why big 0 of one doesn't mean one step. That's constant time because it's like moving Olivia's hand and Stella's hand but not Achmed. So that's at least two steps but it's still always two steps. So if we could get a round of applause for our volunteers here? [CHEERING] You can keep both your numbers and these if you'd like. Thank you so much. May let this help you P set five. Oh, sorry. All right, thank you. So that's of course just one operation. There could have been other numbers. Like, if we were trying to insert five in sorted order, we would have gotten our constant time or maybe our omega of one because in the best case, the number might end up at the beginning. 20, had we inserted it with our humans, might have been a little more involved because we kind of have to walk the list as we did with Jess. But then she's kind of got to point at the person behind her and the person in front of her because she has to update more hands. Which is to say, that doing insertions or even deletions requires a bit of re-stitching. It's like kind of fixing clothes here if you're trying to maintain a contiguous thread through all of these data structures. But at the end of the day, even though it uses pointers, it really just boils down to getting this logic right. And in fact, let me do an example here of some code of how we might do this. I'm going to go ahead and open up list 0.c and take a look at how this here works. So in list 0.c we have the following code. We first, in main, get a positive number from the user. So I'm going to wave my hands at this because this is kind of old school now. We've been using a do while loop to get a positive number from the user. And I'm calling it capacity. So this is an example of getting it in from the user, calling it capacity. Capacity meaning the maximum possible size for a data structure. That would be a term of art there. Here now is where, per week two the class, I allocate an array for ints this many ints capacity. So that, too, is hopefully familiar. There's no pointers yet. There's nothing too fancy. I'm just allocating an array of size based on what the human just typed in. But here's where it gets a little interesting. The purpose of this program is to just prompt the user for that many numbers again and again and again. So I can type in 1, 2, 3 or 5, 17, 20, 22, and so forth. And just build up an array of numbers and memory, but I'm going to trip over the problem we identified a little bit ago. So here we go. I initialize size to zero because initially there's nothing in the structure. While size is less than my capacity, so while the current size is less than the max, so to speak, do the following. First I'm going to get a number from the user. And the goal is to now insert this number into this list. But now I'm going to do a quick sanity check. Let me check and make sure this number is or isn't in the list already, because I don't want to have duplicates. Why? Just because. I want this to be a very clean list, no duplicates. And so this loop of code here, maybe from week one, week two, week three, is just an example of iterating through an array, looking for the number, and if so, remembering that I found it with a Boolean so that I have an answer found or not found in a Boolean variable. OK. So that's all that code is doing. Still no magic. So now is where the interesting part of the story happens. So if the number was not found in the list already, here is how per week two, we add a number to the end of an array. Because if the size is initially zero, numbers [0] is where the first number should go. If size's initial is one at this point, numbers [1] is where the new number should go. And then I should increment size. But there's a problem here in that once I print out these numbers and the program ends, I can only have inputted as many numbers as are available, as I have capacity for. So it's kind of constrained. So what if I want to do a little better than this? Enlist 1.c which does now introduce new material, or at least an application of the topics this week and last? Here in line nine is technically how I can allocate an array before I know the size of it. So an array, recall, is just a chunk of memory identified by some word, like numbers or students or whatnot. But technically we've seen that there's kind of this equivalence, where if an array is just the chunk of memory, you can technically refer to an array by an address, the address of its first byte just like a string. So this on line nine is a little new. But it's kind of that idea. Give me a pointer called numbers but initialize it to null. There's no space for the number. But this pointer therefore doesn't point to any chunk of memory. It would be like Olivia standing up here awkwardly with no one to point at because we've only allocated space for the first pointer, not for everyone else on the stage. So capacity is by default zero. So here the rest of the program is pretty similar. I go ahead and infinitely prompt the user for a number. I check for errors. It turns out if you read the documentation for get int it will return a special constant called int max if the user stops for writing input. Here I'm just checking with a loop in a Boolean, whether or not this number is in the list, same as before. But here's where I start to use some new functionality. If the number was not found already in the list, and the size of the list already equals its capacity, that is if it is filled, what do I have to do conceptually now? If I've got an array whose purpose in life, as we proposed earlier, is just to grow it? I need to add space for it. So I need to add space, as we were proposing. Even though this is kind of lame in that it's a little inefficient, here's how we do it. I can simply call real alloc, passing in the array that whose memory I want to reallocate. And I just tell realloc how much space I now want. So here, this is the size of the int, four bytes. And this is how many bytes I want. So whatever the current size is, realloc give me one more byte. And then realloc gets assigned to numbers and I check if it's null or not null. And I'm keeping it a little simple. We could add some additional error checking here, but what does realloc do? Realloc is pretty cool because if you pass to realloc, a pointer to a chunk of memory that's like of this size, realloc will look in your computer's memory and if it sees a bigger chunk of memory over here, it will handle the copying of everything over to it for you. And then return to the address of the new chunk of memory and free the old for you. So does the switcheroo. It's still linear time but this is how you would use it without having to alloc and free and use a four loop like we described before. And then you can go ahead and put the number in the list as before. So the only new thing here, even though we're going through it quickly, is that this is how you call realloc. You pass in a pointer that's previously pointing to a chunk of memory or even null. That's OK. If you pass it in a pointer that's pointing null, it will give you back the address of just one byte and then the next time two bytes, three bytes, and four bytes. But with linked lists things get a little more interesting. And the syntax is going to be a little funky but let's see. Here it turns out is how we can implement each of our human volunteers in code. Each of them I called a node and node is a term of art in CS. It refers to some data structure that contains some information. Each of them was holding a number, which we called an int. And then each of them, this is kind of funky, had a left hand called next that was meant to point to someone who looked just like them structurally. So the idea here is that we don't want to just have another structure inside of a structure, otherwise you would get this sort of infinite Russian doll kind of thing going on. You instead want to say, each of these structures has a pointer to someone else who looks like them structurally, too. And that's how we get the left arm metaphor implemented in code. So that just defines a node, one of our volunteers. Meanwhile though, here's how we would implement Olivia in one line of code. So Olivia was herself a pointer to a node. She didn't have a number, right? Her sign just said first. She was not holding a number. So we don't need a whole structure for Olivia. We just need a pointer to one such node structure. But initially she was just kind of standing here so we'll just say she was null initially. So the rest of this code is presumably about malloc-ing someone like Stella from the audience, updating Olivia, using Jess to actually update pointers temporarily. So let's see what this looks like in code. So while true, I'm just going to prompt the user for numbers like before. As before, I'm going to check for errors in the same way. Here's a little different. Here's the block of code wherein I just check if my current linked list already has the number I'm trying to insert. But remember, we took away the expressiveness of square brackets. Can't do that anymore. I have to now do this with pointers. So here we go. I, with my four loop, initialize a pointer, Jess, to point at the same thing Olivia was pointing at, numbers. So again, Jess was also just a pointer. She was not holding a number. She was holding PTR, so she was just one pointer pointing at the same thing as Olivia. Here we're saying, so long as Jess is not equal to null, so as long as Jess doesn't walk off the edge of the stage, go ahead and do the following. What do I want to do? And this syntax is new. We saw at the beginning of today the dot operator, which says take a data structure like students and go into it with the dot operator. Get their name and their dorm. That was because the first demo today did not use pointers. It just used structures. Now we're using structures and pointers. And so the syntax changes just a tiny bit. When you have a pointer that is a pointer to a structure and you want to follow that pointer and go to that structure, the one piece of syntax in C that maybe actually maps to reality or concept is this arrow operator, which means follow the left hand, look at the structure, and get at that number. And so if the volunteer's number equals the number that Jess was looking for, go ahead and say found is true. Otherwise update Jess or pointer to equal whatever her left hand is pointing at. So if Jess was temporarily pointing here, she would then update herself by pointing there. And so that's all this code is doing. Jess starts to point at whatever her left hand was pointing at. She moves physically on the stage. All right, so now is where things get a little ugly. And we'll do this with a hand-wave because I think this one is better done at a slower pace on one's own. And we'll come over these kinds of things in section and beyond. Here's how I allocate space for a new node. When I said malloc Stella, it's this line of code here, 45. Malloc space for the size of a node and store it in the person that Stella embodied. Otherwise, if there is not enough memory, if something goes wrong, return one. Meanwhile, here's how we add the number to the list. So this is exactly what Jess ended up acting out. First we handed Stella her number, which is line 52 here. We technically told her to point at a garbage value, so I've improved the code since. So line 53 would be like telling Stella, point here, not here. So that's just cleaning up that omission last time. And then here we have the same kind of code again, a four loop that looks kind of funky but it's just like updating the hand as you walk through the list. And here's where the interesting part happens. At the very end of our story, Jess kind of manipulated our volunteer's arms. So if not pointer next, which is a cryptic way of saying, if pointer next equals equals null. So if Jess has found the end of the list, go ahead and update whoever she is pointing at's left hand to point to Stella, the new node. Then break out because we're done. So syntactically, this is hard and problem set five will afford us opportunities to walk through very similar code. But for now, just realize that all we're doing is instead of just using super simple arithmetic, plus one, plus one, plus one, we're just kind of following these arrows, following these arrows. And the kind of syntax we'll use for that is just this, which is not very readable at first glance. But that's why I grasp onto, if you are a more visual person, the kinds of hand manipulation and arm changes that we were doing here physically with our volunteers. And then we, again, print up [INAUDIBLE].. The last thing here I'll note, and you'll do this in problem set five, is here's how you might free a whole length list of numbers. I just kind of congratulated our volunteers and everyone left the stage, thereby being freed. But if we wanted to do this more methodically, we could use a four loop but here I chose to do a while loop, because it's a little more succinct design wise. Here was our pointer, temporary pointer pointing at numbers. And here I can say while pointer is not null because if it's null my work is done. Here I go ahead and say, update this value next to equal whoever's next in the list. Free whoever's currently in the list. And then update the next pointer. So again, don't worry too much about the lower level details here. But just take away for today that we do now have a way of implementing, in code, the higher level intuition that derived from this kind of data structure. But don't fret yet about the code itself. But we now have the ability to stitch data structures together like this. Upside of which is now we get dynamism, right? We're no longer stuck painting our ourselves into the proverbial corner with arrays by not allocating enough memory. Or conversely, wasting memory by allocating way too much just so we don't have to deal with the problem. But we pay a price with the linked list. We get dynamism and can more efficiently add a node, subtract a node, and we just have to in constant time, update those pointers. But we spend more memory for all these darn pointers. And frankly, the code is more complex. So recall from our first or second week, human time, programmer time is a valuable resource. And making something harder and more time consuming to implement might not be a price you want to pay. And so even I was just chatting with a colleague yesterday about how in graduate school I used to cut corners, especially late at night when writing code. And I would write sometimes deliberately really bad code that might take like eight hours to analyze some data set for some research project I was working on because you know what? I realized it was faster for me to write bad code, poorly designed, that takes eight hours because in those eight hours I could just go to sleep, frankly. Now I would say that was only because my advisor was not grading me on correctness and design and style. But it is a manifestation of a very actual resource that I don't recommend you cut that particular corner for now, since one of the goals of being in a class is to get better at design. But at the end of the day and in the real world, even CS50 staff and I are constantly making decisions. Well, yeah, we could improve this feature of help50 but it's going to take a week to do it. Or we can just throw in some extra line of code and get it done now. And it's a trade off. And this is what makes code good and bad. And when you start to cut these corners in the real world, you start to accumulate what the world would call technical debt. And debt tends not to be such a good thing. And that speaks to the price you're paying in the long term because it might take me and the staff longer this summer now to go back in and clean all that up. And God forbid, overnight frankly, and this happened more often than I should admit, my code was buggy and bailed out at like 2:00 AM I wake up eight hours later thinking, my data's ready. No. I should have done it right the first time so I could rerun the code again and again. So what else do we get now from this ability to have pointers in data structures? So there's this picture here from Mather's Dining Hall. The cap represent the notion of actual trays. And we've been using the stack in a very low level arcane way to talk about memory management, which isn't all that useful to us for solving problems. But the data structure is. It turns out there is a data structure in computer science called a stack. And your computer, Mac or PC, are constantly using it to manage functions and memory, but we can use it, too, for various applications. We can implement a data structure within I have two operations. They're conventionally called push and pop. Though it's like add and subtract. You can call it anything you want but most programmers would call it push and pop. Push is like adding a tray to the stack and pop is like taking one off. But just as the name implies with the stack, what's this characteristic of a stack is that it is an example of a LIFO data structure, last in first out, L-I-F-O. Now what does that mean? Well, if one of the staff from the dining hall comes by with a new tray that's just been cleaned and he or she puts it on the top of the stack, which one is a normal human being going to grab first? The last one in, right? It'd be strange and kind of difficult to get down on your knees and pull out the bottom one, even though that would be more fair, right? Like that little tray down there has been waiting the longest to be used. But it's under the weight of the whole stack, literally. But that, nonetheless, is how a stack would work. And you can implement the stack now in a couple of ways. And here's where the world gets interesting in programming, in that there is this distinction between design of data structures and low level implementation details. A stack is as I've described it, a LIFO data structure. Push and pop, last in, first out. That's it. How you implement that could be any number of ways. For instance, I could implement a stack as a C data type, custom one, that has an array of numbers for this capacity where capacity is some big constant like 100, 1,000, however many trays I want to store so long as I keep track of the size of how many trays are in it so that I can always make sure its size less than or equal to the capacity. Just to make sure I don't try to cram too many trays in there. But what's a downside of this implementation of Mather House's stack of trays? AUDIENCE: [INAUDIBLE] DAVID MALAN: Limited space, exactly. I have consciously hard coded capacity to be some fixed value. So if we buy a new tray or a whole box of trays arrive, might not fit there, right? Once I exhaust this remaining space, I need to make a new pile or I need to store them elsewhere. I'm just out of space. So maybe this is a good design decision in that it reflects reality. Or maybe it's stupid because now I can't store even more trays when they come in via shipment. So I could solve that. We know from our brief example a moment ago, you could just make your array dynamic. Don't preallocate it to be of size capacity. Just declare it to be a pointer that will eventually point to maybe space for one tray or 100 trays or 1,000 trays or maybe 1,001 trays if we realloc the space again and again. And when you start writing code that involves other people, whether it's for some school project or a personal project or just in the real world, this is where life gets more interesting, too, because so long as you and I, if my colleague kind of decide, OK. I'm going to expose push and pop as the operations. I will implement push and pop. You don't have to worry about the low level implementation details in my own design decisions. You just have to read my documentation and not care how I've implemented it because I have abstracted that away for you and given you just an API. Push and pop would be an API, application programming interface. All you need to know is that you can trust that I will implement push and pop. And you might dislike it ultimately, if I limit your space, but to understand that you need to read the documentation to know what features my implementation are providing. Now this of course is the ridiculousness that ensues every year or so, whereby people line up to buy an iPhone. Now, why would it be a bad thing if Apple used a stack when people arrive at 3:00 AM for their iPhones? Yeah. AUDIENCE: [INAUDIBLE] DAVID MALAN: Exactly. The person who came last would get their phone first, which is fantastic for that person. But it's really unfair to everyone else. So Apple of course, like most stores, if they even have this problem, have queues or lines whereby it's a FIFO data structure, first in, first out. Were the first person in line hopefully gets his or her iPhone first. So how can you implement those operations or that API? Might call it nq and dq, but add, subtract, whatever. But these are the terms of art. And we might implement it as follows. A queue might just need a little bit more information than capacity and size alone. You have to remember who's in front, potentially, just so that when that person gets out of line, you don't have to move all of your data in the array just like the humans would walk forward. That's a waste of time. Every time someone's ready to buy their phone, why does n-1 people have to take a step forward? Why not just bring the phone to them and save that inefficient use of time? Or we could do it like this, more of a dynamic data structure. And we won't do the code here. But we've seen, for instance, the example in our list zero, one, and two code, how you could start with a fixed size array, make it dynamic with malloc and realloc, and how you might further make it dynamic with a linked list, albeit with trade offs of time and space. There's this great short video I thought I'd share here, wherein in Jack learns the facts about queues and stacks which distinguishes these two data structures in a way that actually paints an even more clear picture of how they're distinct. If we can dim the lights for 60 seconds or so. [VIDEO PLAYBACK] - Once upon a time there was a guy named Jack. When it came to making friends, Jack did not have the knack. So Jack went to talk to the most popular guy he knew. He went up to Lu and asked, what do I do? Lu saw that his friend was really distressed. Well, Lu began, just look how you're dressed. Don't you have any clothes with a different look? Yes, said Jack. I sure do. Come to my house and I'll show them to you. So they went off to Jack's and Jack showed Lu the box where he kept all his shirts and his pants and his socks. Lu said, I see you have all your clothes in a pile. Why don't you wear some others once in awhile? Jack said, well, when I remove clothes and socks, I wash them and put them away in the box. Then comes the next morning and up I hop. I go to the box and get my clothes off the top. Lu quickly realized the problem with Jack. He kept clothes, CDs, and books in a stack. When he reached for something to read or to wear, he chose the top book or underwear. Then when he was done he would put it right back. Back it would go, on top of the stack. I know the solution, said a triumphant Lu. You need to learn to start using a queue. Lu took Jack's clothes and hung them in a closet. And when he had emptied the box, he just tossed it. Then he said, now Jack, at the end of day, put your clothes on the left when you put them away. Then tomorrow morning when you see the sun shine, get your clothes from the right, from the end of the line. Don't you see, said Lu, it will be so nice. You'll wear everything once before you wear something twice. And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself all thanks to Lou and his wonderful queue. [END PLAYBACK] DAVID MALAN: So that isn't to say that queues are all that and stacks are a bad data structure. They actually each have their own applications. And in fact, one common use for stacks beyond memory management, as we discuss in a couple of weeks when we start exploring HTML and web programming, you'll see that HTML itself, this is the language in which web pages are written. That you'll soon be able to write if not already. This is a language that actually has a nested hierarchy to it. Who, where by a browser, might actually use a stack to analyze the HTML that composes a web page to determine, for instance, if it is correct or not. But there's so many other tools that we can now add to your toolkit. And even though we'll look at each of these just briefly, each of them derives from these very two simple principles, the ability to come up with custom data structures inside of which are pointers, or the ability to stitch one thing to another. So here's an example of what a computer scientist would call a tree. The node here we've drawn as circles just because. But the nodes in a tree are much like a family tree, where each node has zero or more children or descendants, maybe a parent or other ancestors. And so we'll call things like the first node at the very top in a data structure called the tree, the root of the tree, albeit growing downward like this like a family tree. Anything at the very bottom of the tree that only has arrows going into it will be called children or leaves of the tree. And so this might be a way to lay out data in a useful way. In fact, if you think back to when we had things like numbers like this, thus far any time we dealt with numbers or words or Mike Smiths, we would just order them from left to right in an array and then search the array either in big O of end time linearly from left to right. But we did better using what? Binary search, but for binary search it needed to be an array and it needed to be sorted. And the problem I never dealt with was we never actually added another page to the phone book. We never actually tried to add more numbers to our array. And yet today, we've kind of identify these very glaring issues with arrays, which is that you're kind of painted into a corner. If you allocate only so much space, you use it all up and then darn it, you want to add more to the array. So how can we maybe still lay out data in sorted order, still leverage something like logarithmic time and divide and conquer, but get today's benefit of the dynamism whereby we can grow the data structure and shrink it very incrementally, without having to all of a sudden reallocate the whole structure? Well, instead of laying out these numbers, which are conveniently numbered as multiples of 11 here, 22, 33, 55, what if we laid them out like this in memory? We won't look at the code for this, but think of each of these circles as a structure, a data structure, inside of which there's an int n, how many pointers apparently? Seven total on the screen. But how about within each node, like this one here? There's a number n, 55. And what else? How many pointers? Just two. Two maximally, in fact, because the leaves, it would seem, have zero by definition. And technically, if I hadn't added 22, maybe there could just be one child. This is what we call a binary tree because every node has at most two children, 0, 1, or 2. And it's technically a binary search tree because of a special property. It's very searchable because if you look at any node, its left child is smaller. And if you look at any node, its right child is bigger. And that's a recursive definition, so to speak. You can look at any node in the tree and that definition is true. Even the leaves, because it's sort of a vacuous statement to say it's greater than its left child if there is no left child. It's sort of trivially true. So what's nice about this data structure? Well, suppose I want to search for the number 22. Like our linked list, and like Olivia being the special first pointer, a binary tree in a computer's memory would just have one special pointer, called root or first or whatever you want to call it. And if you want to look for 22 just like Olivia and Jess were, you might look here. And say, hmm, 55 is greater than 22. So which way do I go? Left obviously. And here, you know, if we were doing this visually we could snip off that whole subtree. And you would see half of the problem be torn away like the phone book. 22 versus 33, of course, this is greater. So we go here and we find it. And long story short, that was not linear because we weren't searching all of the nodes. And if conceptually we were chopping the tree in half, in half, in half every time we went left or right, what should be the running time of search on a binary search tree? Log base 2 of n, or just logarithmic as we've seen. Now it's not necessarily always as prettily balanced. This is very deliberately chosen. You can get perverse situations where it just kind of devolves into a long linked list. But it still is a binary search tree. It was just poorly built. But at least if we keep a balance like this, we can gain some benefits. And here's how we would implement your proposed integer and two nodes. Instead of just calling it next, I'm going to call it more semantically usefully left and right. And notice that struct node is just a pointer called left. Struct node is a pointer called right. And that's how we implement these. And what do you think the leaves have as their values for left and right? The leaves of the tree had no children, by definition, so what's the value of left and right? Null. So they're just sort of pointing down at the floor as zero, null, values. So we're not going to write the code for this now, but we can leverage weak zero's ideas. Divide and conquer, binary search. We can leverage last week and this week's ideas of structures and dynamic memory and technically the heap in order to start to build up data structures like this, that now give us dynamism that can grow and shrink as needed. And just so you've seen the code, here might be an implementation of a function for binary search tree that, given the roots of the tree, finds for you true or false, whether or not something is in it. So I want to search for a number n in this tree. So this here is, again, a pointer to the root just like Olivia was a pointer to the first node in the linked list. So if the tree is null, return false because it's kind of a stupid question to ask. If there's no tree being passed in, it's clearly not there. So return false. That's our special case to ensure that we don't dive too deeply. But here's a very cool application of a past idea. If n is less than the n at the current node in the tree, and remember the arrow just says, go there and look at n, we know we want to look at the left hand side of the tree. So do we have an algorithm to search a tree for a specific value? Just so happens that tree is now smaller because it's this half of the tree on the left. We do. We have a function called search that takes a number as input and takes a tree as a pointer. That doesn't have to be the whole tree. It can be a sub-tree because again, a tree is kind of recursively defined, because every left child and right child itself might have children. So it's a smaller tree but it's still a tree. So I can answer this question. If n is less than the current node's own n value, I can just return the answer to calling search on the same number, but passing in just the left half of the tree. So this is like the tree version of tearing the phone book in half and searching only the left half. And you can perhaps guess, if you're following along at this point, if n is greater than the current node, we're just going to search to the right. And that's three cases. What's the fourth possible case? Yeah, if n equals the current node. And so in that case I'm just going to trivially return true. And this is kind of beautiful. It's not from one perspective. It's not obvious at first glance, how this works. And it's not comfortable necessarily if you're not used recursion that much. But what's beautiful about this, especially if we get rid of the stupid curly braces and a lot of stuff it's not really intellectually interesting, you are reducing this problem to really just these lines of code. Check for null, return false. Check if it's less than, just recurse on the left. Check if it's greater than, recurse on the right. Otherwise you found it. It's literally the same idea or spirit as our divide and conquer approach for the phone book, just implemented now using trees or nodes linked together in a tree. Yeah? AUDIENCE: [INAUDIBLE] DAVID MALAN: Tree arrow n, so tree, recall, is a pointer to a node. So that, just like Olivia was a pointer to a node in a linked list, this would be like Olivia standing here and instead of pointing at a line of students, sort of pointing at a tree of students that fans out this way. So tree, we could call it anything we want. I just called it tree, represents that. And meanwhile, tree left would be like if Olivia was pointing at a node here. Actually, if Olivia is pointing at the root of the tree here, tree left would be go look at the left half of the tree or the right half of the tree. If again, our volunteers were laid out on stage like a fan, like a tree instead of a list. So we've seen a whole bunch of algorithms that might have any number of these running times. And up until now kind of the best running time really has been this for the fanciest of algorithms. But we have seen constant time here and there. And even today if we want to insert into a linked list and we don't really care about the order, we can just plug the new value right there after Olivia and before Achmed and get our constant time. But wouldn't it be nice if more operations were constant timed? One step, two step, three stepped or some finite number? And it turns out we can achieve this with a bit of thought. And we can leverage another sort of familiar idea as follows. So like here, for instance, is some unusually large playing cards, which actually do exist if you just Google jumbo playing cards and look for them on Amazon. Suppose I wanted to sort this deck of cards. I could go through the deck one at a time and order them both by their suite, like clubs and hearts and so forth, and also by their numbers. But odds are if you're like me, you're going to probably try to make the problem a little simpler. And if you see the king of spades here, I put him over here. Nine of spades, I'm going to put that there. 10 of spades coincidentally. I'm going to put that over here. Then I'm going to do what with the hearts, probably? You know, probably I'm not going to go through one at a time. I'm going to kind of bucketize each of the cards. So here's the ace of clubs, so I'm going to make a third pile. Here's a couple of diamonds. So that's my fourth pile. And then I'm just going to repeat this, because it's a nice simple algorithm. It's going to make my life a little easier in just a moment once everything is in the right pile. But this is a general notion of what we'll call hashing. And I'm not going to finish it because surprise, surprise, we're going to get 13 cards in each pile. But this is a more fundamental notion of hashing. You take as input something from your list of inputs. You look at it and you make a decision based on it. And in this case, my hash value is going to be zero, one, two, three, two because it's going to go into the hearts pile. And what is a hash function? It's just going to be a function in code or in my brain that just makes a decision based on output and outputs a hash value, which in this case is going to be that pile, that pile, that pile, that pile, or if we want to be more precise, zero, one, two, or three. If those four piles are implemented it's like four arrays or some kind of stacks, really. I seem to be making a stack of cards. Now it's not done. If I want to sort these things later, I'm still going to have to sort each of the piles of 13. But I've kind of made the problem a little easier for myself in that I've spread it out over four equivalent problems. But the key ingredient here is that notion of hashing. And honestly, if you've ever watched a TA or professor deal with these things at the end of some class that has blue books, if a whole bunch of students at the end of the hour come down and start handing in their blue books it's a complete mess. And if the TFs or professor wants to organize these, you might make a whole bunch of piles. All the L last names will go there. E will go there. F will go there. And maybe in this case you'll alphabetize as you go, thereby making this problem easier, too. That is a hash function. You take as input a student's name, you look at the first letter of his or her last name, and you decide whether it goes in bucket zero, one, or maybe 25 if you're indeed hashing based on the English alphabet. So hashing is something we've all done, even if we've never slapped that name on it before. So how might we leverage this kind of ingredient and get ourselves closer to the holy grail of data structures, which would be constant time for everything? Like none of this linear, none of this logarithmic time. So suppose we have an array or a table, we'll call it, like this. I'm going to call this a hash table because I want to leverage the idea of this hash function. And suppose that what I want to store in here are just things like names. And I want to go ahead and store the name Alice, because she turned in her exam first. So here I might have [0] through [25] or in general, n-1. So there's 26 buckets total. Where might I be inclined to put Alice? I might just hash her to zero because Alice, we'll use her first name, not last, because she never seems to have a last name. So first name, Alice, brackets zero, she goes there. Then Bob comes up, turns in his exam. Where does he go? [1] and then maybe Brendan comes over. Damn it. No room for Brendan's exam. Why? Because he hashes to the same value. And this can happen. Like, you might hash to the same value. And here it was not a big deal. I kept getting diamond, diamond, diamond. That's fine because this data structure grows. But this is an array, it would seem. And I could write Alice here, I could write Bob here, but Brendan should be written there too. But I don't want to give Bob his exam back just to accept Brendan's. So where could I put Brendan? Maybe I'll kind of cheat and just put him here because there's room, right? This is all free in this story so far. But then Charlie comes forward. What do we do with Charlie? Now Brendan is where Charlie should be. So now I've just kind of made a mess but I have so much free space and odds are I'm not going to have a student, no offense, whose name starts with a Z or an X or and some of the statistically less likely ones. So why don't we use those spaces? And we could, but this is an example algorithmically of linear probing, where you linearly top to bottom just kind of probe the data structure looking for space and just drop the values in the first available. And initially it's nice and clean and nice and efficient because if I want to look for Alice's exam later, boom, she's on the top of the pile. Bob, boom, second in the pile. But then Charlie, not quite where he should be. So eventually with this approach of linear probing it's space efficient in that you pack everyone into your data structure. But it eventually devolves into something linear. If Alice came and given her exam last, by nature of space, she might end up at the bottom of the pile and that does not make her easy to find later. So what if we instead change the data structure and use elements from today and past? Let's use an array here of pointers drawn vertically just because. And then why don't we string students' names off the right of this? So this is an excerpt from a text that explores exactly this data structure. It's called a hash table, not with linear probing, but with separate chaining, whereby your data structure, your hash table, is technically an array. This time it's upsized 31, because the book's example was about day of the month for birthdays. And so the data structure has not just an array, though, but what other data structure combined with it? It's a kind of linked list. So what's nice here is that S Adams, so Adams, starting with A in our story. Now they're using birthdays if you read this in the context. But suppose that Adams is the only one with birthday on the second of some month. Well, he or she might end up here. And that's no big deal if someone else has the same birthday in this example, because we can either walk the list as we did with Jess and just string him or her at the end of this data structure. Or we can just kind of insert them at the very beginning and just use some constant time changes to peoples' left hand to fit them in. The point is though, the data structure no longer is an array only. It's an array of 31 buckets, four piles, 26 piles, 31 piles. But each of those piles can grow vertically, so to speak, or in this case laterally because we're implementing the idea of these data structures now by using an actual linked list. So why is this actually better or worse? Well one, is there any limit now on how many students can turn in their exams or have birthdays? No because we just keep growing it wider and wider and wider. Why is this then a good thing? Well now if I want to look someone up, if I know their name starts with A or in the birthday example, I know their birthday is on the second of the month, I know deterministically, no matter what, what bucket they will be in in the array. Now, they might be in a long string of people with similar names or birthdays. But they're going to be there, deterministically, predictably, again and again. And the beautiful thing is if my hash function is well-implemented, uniform so to speak statistically, then it would be nice if almost all of these chains are roughly the same length. It would be pretty lame if this chain were really huge and then every other chain were shorter because that's just an opportunity for better design. So in real terms, a hash table, when implemented like this, should decrease in this concrete case, by a factor of like 31, how long it takes to find someone. So the time is one divided by 31 because if all the chains are roughly the same length, you have chopped up your data set into four piles, 26 piles, 31 piles, each of which is one fourth or 126th or 131st the size of the whole data set. Now asymptotically, per couple of weeks ago, that is algorithmically irrelevant. That's big O of the same thing, so to speak. But in real terms, having it taking a quarter of as much time, 126th the amount of time, 131st the real time is literally going to save us times on our watches. Like that in real human times will save time. And in fact, what you'll see in problem set five, in which you implement your first spell checker, you'll see that that's what we're trying to optimize for. In fact, as a quick teaser before we look at our final data structure here, you'll be challenged as part of this problem set optionally, if you'd like to opt in, to compete on the big board. Once your code is working per check 50, you can actually run a separate command with check 50 to post it to the leader board here. And right now, damn it, Brian is beaning both Doug and me because his implementation of the spell checker takes only 4.81 seconds and only 7.4 kilobytes versus my 82 megabytes of memory implementing a spell check over the whole lot of words. But how do you decide how to minimize space or minimize time and how do you mitigate some of the trade-offs? Well, let's look at one final data structure to consider. This is perhaps the most sophisticated and it takes up more space and so it's hard to paint on the screen. But suppose we did this. Suppose we were trying to store in our data structure people's names. We could do this with an array of a lot of strings. And we could do linear search and Brian or Doug or I could just use linear search big O of n and find any one you want. That's not so great. We could somehow use binary search if we used a tree or an array but kept the names sorted. We know we can do better. Just as we found Mike Smith pretty quickly in week zero. But what if we could find names in constant time? Whereby no matter how many words are in the tree, no matter how many words are in the dictionary more generally, still takes me the same amount of time to find anyone? And it doesn't get longer and longer the more names we add? So here is a type of tree goofily called a trie, T-R-I-E, which is an excerpt from retrieval, which is weird because it's retrieval and retrival but this is a trie, T-R-I-E. Each of the nodes in a trie, essentially, are an array themselves. Technically they're a structure with a little more inside of them. And you'll see this in the walk through that Zamyla put together. But each of the nodes in the trie are an array. Each of those arrays elements is a pointer to another such array. And the way you store words in a trie is not with characters, but implicitly with pointers. So if we want to put someone's name like Maxwell in here, we hash into this trie using the first letter of Maxwell's name, which is of course m. And that's going to be the 13th element in the array in my 26-element array here. I'm going to change that originally null pointer to be a pointer to another node. And then I'm going to hash on the second letter of Maxwell's name, which is A, and I'm going to allocate a pointer to another array. And then repeat that process for every letter in his name. So if I hash on his first letter, second letter, third letter, every time I do that it leads me to a new array. What's not shown here is that each of these arrays is size 26. It would just be atrocious to see on the screen. So it does use a bunch of memory. But the end of this, there's a special symbol drawn here is a delta symbol, but it can be anything, that just means Maxwell stops here. There's a word here. So how many steps does it take to find any name in the tree? Well, to find Maxwell it's M-A-X-W-E-L-L. So that's seven steps. For Maria it'd be M-A-R-I-A. That would be five steps. So it's still dependent on the number of letters in the name. But if there is a billion names in this dictionary, per this definition, how many more steps does it take to find Maxwell? M-A-X-W-E-L-L. How about if there's four billion names in the dictionary? How long does it take to find Maxwell? M-A-X-W-E-L-L. it's invariant. And if we assume that no human name is going to be super long, it's effectively constant whether it's 10 characters, maybe 30 characters or whatnot. That's effectively constant, which means a trie gives you constant time look up or big O of one, which means it's in theory the fastest of data structures. But of course you pay a price with more memory. And I know we're one minute over but let me tease you with this final look. And you'll see this data structure's implementation with Zamyla. But we begin to do transitionally now, especially if you're a little worried, especially as we're coming on the midpoint of the semester, like oh my god. Things are getting more and more sophisticated. We're kind of at the peak of a hill here because after problem set five do we transition to HTML and CSS and Python and JavaScript and web programming more generally. And next week, how the internet works. [VIDEO PLAYBACK] [MUSIC PLAYING] - He came with a message, with a protocol all his own. [MUSIC PLAYING] He came to a world of cruel firewalls, uncuring routers, and dangers from worse than death. He's fast. He's strong. He's TCPIP. And he's got your address. Warriors of the net. [END PLAYBACK] DAVID MALAN: All right. All that and more next week. We'll see you then.
B1 US memory array pointer malan david malan stack CS50 2017 - Lecture 5 - Data Structures 50 7 小克 posted on 2017/11/14 More Share Save Report Video vocabulary