Placeholder Image

Subtitles section Play video

  • SPEAKER 1: All right, this is CS50 and this is week five.

  • And let's take a look at where we left off last time.

  • You may recall this guy here, Binky from our friends at Stanford.

  • And we used Binky to start talking about pointers.

  • What is a pointer?

  • So, a pointer is just an address, the location

  • of some piece of data in memory, because recall at the end of the day

  • your computer just has a few pieces of hardware inside of it, one of which

  • is RAM or Random Access Memory.

  • And in RAM you have the ability to store bunches and bunches of bytes,

  • or kilobytes, or megabytes, or gigabytes,

  • depending on how much memory you have.

  • And if you assume that no matter how much RAM you have you

  • can enumerate the bytes-- this is byte 0, this is byte 1, this is byte 2,

  • and so forth-- you can give each of the bytes of your computer's memory

  • an address and those addresses are simply called pointers.

  • And now in C we have the ability to use pointers

  • both to go to any location in memory that we want

  • and even to dynamically allocate memory in case we don't necessarily

  • know a priori how much memory we might need for a program.

  • Now, in terms of your computer's RAM, recall

  • that we divided the world into this picture

  • here whereby if this rectangular region, arbitrarily, represents your computer's

  • memory, here is how the computer divvies it up

  • when you're actually using a program.

  • At the bottom of your computer's area of memory, you have the so-called stack.

  • And recall that the stack is where any time you call a function,

  • it gets a slice of memory-- a frame of memory,

  • if you will-- for all of its local variables, all of its arguments

  • and anything else that it might need.

  • On top of that might go another slice or frame of memory

  • if that first function calls another.

  • And if that second function in turn calls another function,

  • you might have a third frame on the stack.

  • Of course, this doesn't end well if you keep calling function after function

  • after function after function.

  • And so, hopefully you don't accidentally induce some kind of infinite loop

  • such that these frames pile on top of each other infinitely

  • many times, because eventually they'll run the risk of hitting the heap.

  • Now, the heap is the same type of physical memory.

  • You're just using it in a slightly different way.

  • The heap is used any time you want to dynamically allocate memory,

  • when you don't know in advance how many bytes

  • you need but you do know once the program is running how many you now

  • want.

  • You can ask via functions like malloc the operating

  • system for some number of bytes, and those bytes

  • are allocated from the heap.

  • So, those two have addresses or numbers.

  • And so, the operating system, by way of malloc,

  • just figures out which of those bytes are not yet

  • being used so that you can now put whatever piece of data

  • you have in that particular place.

  • Now, beyond that [? appear ?] things like initialized data,

  • uninitialized data.

  • That's where things like global variables that are initialized or not

  • end up that might be outside of your main function.

  • And then above that is the so-called text segment,

  • which are these zeros and ones that actually compose your program.

  • So when you double click an icon on Windows or Mac OS

  • to run a program or you type dot slash something in the Linux command line

  • environment in order to run a program, the bits that compose your program

  • are loaded also into memory up into this region here.

  • So, at the end of the day, you have access to just pretty generic memory,

  • but we use it in these different ways.

  • And it allows us to ultimately solve problems

  • that we might not have been able to in the past.

  • Recall for instance this example here, deliberately shown in red because it

  • was [? buggy. ?] This does not work.

  • Now, logically, it does do the swap that we intend whereby a goes into b and b

  • goes into a.

  • And we achieve that result by way of this temporary variable

  • so that we have a temporary placeholder into which to store one of those values

  • while doing the swap.

  • But it had no permanent impact on the two variables that were passed into it.

  • And that was because by default in C any time you pass arguments to a function,

  • those arguments are passed so to speak, by value.

  • You get copies of those values being passed into a function.

  • And so, if main, for instance, has two variables, x and y--

  • as they did last time-- and you pass x and y

  • into a function like this one here swap, x and y

  • are going to get copied as a and b respectively.

  • So you might perfectly, logically, correctly swap a and b,

  • but you're having no permanent impact on x and y themselves.

  • But what if, per this green version here,

  • we reimplement swap to be a little more complicated

  • looking, but at the end of the day actually correct?

  • Notice now we've declared a and b not to be

  • integers but to be pointers to integers, the addresses of integers.

  • And that's what's implied by the star that we're putting

  • right there before the variable's name.

  • Meanwhile, inside of the body of this function,

  • we still have three lines of code.

  • And we're still using a temporary variable, and that in itself

  • is not a pointer.

  • It's just an integer as before, but notice

  • we're using this star notation again, albeit for a different purpose

  • to actually dereference these pointers.

  • Recall that int star a and int star b means give me a variable that

  • can store the address of an integer.

  • That's declaring a pointer.

  • Meanwhile, if you just say star a without declaring something

  • to the left of it with a data type like int,

  • you're saying go to the address that is in a.

  • So if a is an address, star a is at that address,

  • which of course per its declaration is going to be an integer.

  • Similarly, star b means go to the address in b.

  • Star a means go to the address in a and put the former into the latter,

  • ultimately putting the value of temp at the address in b-- so absolutely more

  • complicated at first glance, but if you consider again

  • the first principles of what's going on here, all we are doing

  • are moving things around in memory.

  • And we can do that now because we have the ability

  • to express the locations, the numeric locations of where

  • things are in memory.

  • But nicely enough, we, the programmer, don't have

  • to care where things are in memory.

  • We can access things symbolically as we're doing here with a and b.

  • So even though we might have seen on the screen

  • or you might see while debugging actual addresses of memory,

  • rarely does that actually matter in practice.

  • We can deal with everything we've learned thus far symbolically.

  • Now, last time we also took a look at the world of forensics,

  • and we took a look at how images are implemented and specifically file

  • formats like BNP, and JPEG, and GIF, and yet others.

  • And we glanced into [? Asmila's ?] here as we tried to enhance this image,

  • but of course, there was only finite amount of information.

  • So, what you see is what you get in terms of any kind of glint

  • or suspect in her eyes.

  • But we did this in part so that we could also

  • introduce another feature of C that allows us to declare our own data

  • types, indeed our own data structures.

  • For instance, we proposed that if you wanted

  • to write a program that stores a student,

  • you could actually declare your own student data type

  • inside of which is a name and inside of which is a dorm,

  • and anything else that you might actually want.

  • Meanwhile, this syntax here gives us a new data type called student

  • so that if we want to write a program that implements students,

  • we can actually wrap related information together like name and dorm

  • without having to maintain a whole bunch of strings

  • for just names and a whole bunch of strings for just dorms.

  • We can actually encapsulate things all inside of one structure.

  • And indeed encapsulation is another principle of computer science

  • that you'll see throughout program and throughout the field itself.

  • So, what do we now do this time?

  • So, today we introduce more sophisticated ingredients

  • with which we can solve problems and we revisit a problem from the past

  • that we thought we had rather knocked off and had solved.

  • So, this might represent a whole bunch of names,

  • a whole bunch of numbers, a whole bunch of telephone numbers in a phone book

  • back to back to back to back stored in this case

  • in the form of an array, the simplest of data structure, so to speak,

  • that we've discussed thus far.

  • And an array, again, is a contiguous block of memory each of whose element--

  • typically are of the same data type, integers, or strings, or the like--

  • and they are by definition back to back to back to back,

  • which allows you random access.

  • Which means you can jump to any of these locations

  • instantly just by using in C that square bracket notation

  • or as we saw last time using pointer arithmetic,

  • actually using the star operator and maybe adding some number two

  • and address to get at some subsequent address.

  • But it turns out there's a few problems with this fundamental approach.

  • Nice and as simple as it is, it would seem that we rather paint ourselves

  • into a corner with this approach.

  • This array has 1, 2, 3, 4, 5, 6 total elements, at least as depicted here.

  • So that's fine if you want to insert a number, and then

  • another number, and then four more numbers.

  • But what if you want to then insert a seventh number,

  • not to mention an eighth number or a ninth number or the like?

  • Well, where do you put them?

  • Well, you might think, well, that's fine.

  • I'm just going to go put the seventh number over here,

  • or the eighth number over here, or the ninth number over there.

  • But you can't just blindly do that.

  • If this memory is being managed not by you

  • per se but by malloc and by the computer itself inside-- and your program,

  • this memory over here, while it might physically exist,

  • might be used by some other part of your program all together.

  • It doesn't necessarily belong to you unless you've asked for it.

  • And the problem with an array is that as we've seen it typically

  • you declare their size in advance, as with the square bracket notation,

  • and say give me six integers or give me six something or others, but that's it.

  • You have to decide in advance.

  • You can't just grow it as you can in some programming languages thereafter.

  • You've rather painted yourself into a corner.

  • But with malloc and other functions like we saw last time,

  • you can actually allocate more memory using malloc.

  • Unfortunately, it might end up in another location in your computer's

  • memory, so you might have to do some copying

  • to take the original six elements and move them

  • elsewhere just to make room for more.

  • And there is a data function for that, something called [? re-alloc ?]

  • or reallocate.

  • And indeed it can do exactly that.

  • It can give you a bigger chunk of memory and reallocate

  • what was previously there to be a larger [? size. ?]

  • But you have to do a little bit of work.

  • You have to invoke it in order achieve that.

  • You can't just blindly keep adding things at the end of this array.

  • Now, unfortunately, while a solution that might not be very efficient.

  • Even if you can allocate a bigger chunk of memory that's bigger than six

  • because you have more numbers, for instance, to store,

  • what if that takes a bit of time?

  • And indeed it's going to.

  • If you allocate more integers somewhere else in memory

  • you still have to copy those original values,

  • and now it just feels like you're wasting time.

  • Now, instead of just inserting things into the list,

  • you might have to copy it into a bigger space, reallocate things, grow.

  • It's a lot more work.

  • And all of that discussion of running time and performance

  • comes back into play, because if that whole copying process and reallocating

  • is costing you time, your algorithm or your program

  • ultimately might not really be as fast as you might want it.

  • So, what could we do instead?

  • What could we do instead in order to solve

  • this problem dynamically, so to speak, that being the operative word.

  • And luckily enough, last week we learned that there is dynamic memory allocation

  • in C by way of that function malloc.

  • And we also learned that there is ways of representing structures

  • in C that you don't necessarily get with the language itself,

  • because they're not primitives.

  • They're not built in.

  • In other words, let me propose this as a solution to our problem.

  • This is a list of, let's see, five numbers it would seem,

  • 9, 17, 22, 26, and 34.

  • Pretty arbitrary right now, but you might

  • imagine very simply drawing those same numbers-- 9, 17, 22, 26,

  • 34-- in the form of an array and they're clearly deliberately sorted.

  • But again, what if you wanted to grow that array or even shrink that array

  • dynamically over time?

  • Well, let me propose that we not draw those numbers back to back to back

  • to back literally next to each other but allow ourselves potentially

  • a little bit of space?

  • But if that's the case and nine is here in my computer's memory and 17 is here

  • and 22 is here, or over here, or over here-- in other words,

  • what if I relax the constraint that my numbers or my data types more

  • generally have to be stored contiguously back to back to back to back in memory

  • and instead allow them to be anywhere, indeed anywhere a function like malloc

  • wants to give me more memory, that's fine.

  • If it wants to give me memory up here in my computer, I'll deal with that.

  • If it wants to give me extra memory over here, that's fine.

  • I'll deal with it, because I'll use these conceptual arrows to stitch

  • together my data structure this time.

  • And now, where have we seen these kinds of arrows before?

  • What feature of C allows us to connect one thing

  • to another where a la chutes and ladders get from one place to another?

  • Well, that's exactly what we saw last time which was pointers.

  • While we've drawn these here per the snippet from a textbook using arrows,

  • those are really just pointers.

  • And what does each of these rectangles represent?

  • Well, clearly a number in the top half of the rectangle,

  • but I claim that at the bottom half of these rectangles

  • let's consider that bottom rectangle to just be another piece of data,

  • specifically an int star, a pointer.

  • Or rather not a pointer because it seems to be pointing not just to the number

  • but to this whole rectangle, so I need some new terminology.

  • I need some kind of structure to contain an integer and this pointer.

  • And for that, I think I'm going to need a struct.

  • And indeed let me propose that to solve this problem we give ourselves

  • this building block as a new C data type called a node.

  • You can call it anything you want, but the convention

  • would be to call something like this in a data structure-- that's

  • like a puzzle piece or a building block in a data structure

  • would be called a node.

  • Let me propose that we define it as follows.

  • I'm using that same syntax from last time with which we declared a student

  • data type, but here I'm saying inside of this data structure,

  • this node shall be an int.

  • And that's pretty straightforward.

  • Just like a student might have a name and a dorm,

  • this node will have an int called n arbitrarily.

  • And then the only piece of detail that's a little bit new now is

  • the second line, struct node star next.

  • Now, what does that mean?

  • It's pretty verbose, but struct node is just recursively, if you will,

  • referring to this same type of data structure.

  • Star means this is going to be a pointer, the address of one such thing,

  • and next is just an arbitrary but pretty reasonable

  • name to give to such a pointer.

  • So this line here, struct node star next,

  • is the incantation in C with which you declare

  • one of those arrows that will point from one node, one rectangle

  • to another node, another rectangle.

  • And the fact that we have a little bit of additional verbiage

  • up here, typedef struct node, is because again C

  • is a language that is read top to bottom, left to right,

  • so words have to exist before you actually use them.

  • So, whereas last time when we declared a student,

  • we didn't actually mention struct student or anything like that.

  • We just said typedef open curly brace.

  • Today, when declaring a node, we actually

  • have to have some additional syntax here just called struct node.

  • And technically this word could be anything,

  • but I'll leave it as node for consistency.

  • And that allows me inside of this definition

  • or to specify that the second data member is

  • going to be a pointer to exactly that kind of data structure.

  • But typedef, just to be clear, allows me to type a smaller name for this data

  • structure here, which I will simply called node at this point.

  • So, what can we actually do with this kind of data structure now?

  • And, indeed, let's give this data structure a name.

  • Let's start calling a linked list.

  • Previously, we had arrays, but now we have linked lists, both of which

  • at the end of the day are types of lists, but linked lists,

  • as the name suggests, are linked or threaded together using pointers.

  • Now, when you have a linked list, what might be some operations,

  • some algorithms that you might want to run on them?

  • Well, if you've got a linked list of say numbers, for the sake of discussion,

  • you might want to insert a new number into that list.

  • You might want to delete a number from that list

  • and you might want to search that list.

  • And that allows us to then consider how we might implement

  • each of these kinds of things.

  • But it turns out while all simply-- while fairly simple intuitively,

  • we're going to have to be a little careful now by way of our pointers.

  • So, let's more formally declare a linked list to look something like this.

  • It's a collection of nodes that are linked together

  • with pointers as represented by these arrows here,

  • but we're going to need some special pointer, at least

  • at the beginning of the list.

  • Let's just call it first.

  • It doesn't necessarily store a actual integer.

  • It itself first is just a pointer to the start of the list.

  • And by way of that pointer can we access the first actual node in the list.

  • From there can we get at the second, from there can we get at the third,

  • and the fourth, and the fifth, and any number of others.

  • And this syntax over here might just represent null.

  • Because you don't want to have that pointer just

  • pointing off into no man's land, that will have to be a null pointer so

  • that if we check for that with a condition we know,

  • OK, we're at the end of the list.

  • So, let's pause for just a moment and consider these three algorithms--

  • insert, delete, and search, and consider what's going to be involved.

  • Well, how would you go about searching for an element of this list?

  • Suppose I wanted to find the number 22?

  • What do you do?

  • Well, me, I, the human can just look at this and be like all right,

  • 22 is right there.

  • But a computer can't do that.

  • A computer every time we've had this discussion

  • can only look at one thing at a time.

  • But moreover the computer this time is even more constrained

  • because it can't just use our old friend binary search

  • or divide and conquer, because how do you get to the middle of a linked list?

  • Well, you have to find your way there.

  • The only thing you have in a linked list from the outset

  • is one pointer called first or whatever it is, but one pointer that leads you

  • to the beginning of the list, the first node in the list.

  • So, if you want to get to the second node in the list,

  • you can't just go to bracket one, or bracket two, or bracket three

  • to get any number of other elements in the list.

  • You have to follow these bread crumbs, if you will.

  • You have to follow these arrows or these addresses to go from one node's address

  • to the other to the other.

  • And so, we've paid a price already.

  • And we'll see that there is still an advantage here,

  • but what's the running time?

  • What's an upper bound on the running time of search for a linked list,

  • even if it is sorted?

  • Any thoughts?

  • Is it constant time like big O of 1?

  • Is it log of n?

  • Is it n, n squared?

  • What's the running time going to be?

  • Well, they're sorted, and that was this magical ingredient, this assumption

  • we've been allowed to make in the past which was helpful,

  • but that assumed that we had random access.

  • In C, we had square bracket notation, so that using some simple arithmetic

  • we could jump roughly to the middle, and then the next middle,

  • and the next middle looking for Mike Smith or whatever element it is we're

  • looking for.

  • Unfortunately here, one price we have already

  • paid already by taking this step toward linked lists is linear time.

  • Big O of n would seem to be the running time of searching a linked list,

  • because the only way you can start is at the beginning,

  • and the only way you can get through the list is by following these arrows.

  • And if there's n nodes in the list, you're

  • going to need as many as n steps to find, something like 22, or 26, or 34,

  • or any elements all together.

  • Well, that's not all that great.

  • What about insert?

  • What's an upper bound on the running time of insert?

  • Well, here too it depends.

  • Suppose that we don't care about keeping the list sorted.

  • That's kind of a nice advantage, so I can be a little lazy here.

  • So, what's the running time going to be if I

  • want to insert a new number like the number 50 into this list,

  • but I don't care about keeping it sorted?

  • Well, instinctively, where would you put this element?

  • Where would you put it?

  • You might be inclined-- you kind of want to put it over here,

  • because it's the biggest element.

  • But again, if you don't care about keeping it sorted,

  • where is the fastest, the quickest and dirtiest place to put it?

  • I would propose let's just put it at the front of the list.

  • Let's take this first pointer, point it at the new number

  • 50 that we've have somehow added to the picture

  • as by calling malloc, asking malloc for a new node.

  • And then have 50, in turn, point to the number 9, and then 9 can point to 17,

  • and 22, and so forth.

  • What if we want to insert another number, 42,

  • and we don't care about where it goes?

  • Well, why don't we just put it at the beginning of the list?

  • Then we have the first pointers pointing at 42,

  • which in turn should point at 50, which in turn can point at 9, then 17, then

  • 22, and so forth.

  • So, if we're just lazy about this, we can actually

  • achieve a great running time for insert constant time.

  • Unfortunately, if we want to keep things sorted then

  • we're going to have to incur a linear time cost again, right?

  • Because if we have to insert 42 or 50, worst case

  • they might belong all the way at the end of the list and that's

  • Big O of n steps.

  • And delete, too, unfortunately, whether it's sorted or unsorted

  • is also like search going to be Big O of n

  • because you don't necessarily know when you're searching for a number

  • to delete if it's going to be at the beginning, the middle, and the end.

  • So, in the worst case, it might indeed be at the end.

  • You know what?

  • Why don't we instead of walking through this verbally,

  • let's see if we can't get some volunteers?

  • Can we get seven volunteers to play-- wow, to play the role of numbers here.

  • 1, 2, 3, 4, 5, 6, and yes, 7, come on up.

  • All right, so I have here some printouts for all seven of you

  • that represent exactly the nodes that we have here on the screen.

  • Let's meet one of our first contestants.

  • What is your name?

  • AUDIENCE: Scully.

  • SPEAKER 1: Scully, nice to see you.

  • So, you shall be literally first and represent our first pointer.

  • So, if you want to come and stand roughly over here.

  • And then what is your name?

  • AUDIENCE: Maria.

  • SPEAKER 1: Maria, nice to see you.

  • And you can be the number 9 right next to our first contestant.

  • And your name?

  • AUDIENCE: Sarah.

  • SPEAKER 1: Sarah, nice to see you.

  • You shall be the number 17.

  • And your name?

  • [? AUDIENCE: Satoshi. ?]

  • [? SPEAKER 1: Satoshi, ?] nice to see you.

  • You shall be 20.

  • And your name?

  • [? AUDIENCE: Mosof. ?]

  • [? SPEAKER 1: Mosof, ?] nice to see you.

  • And you shall be 22.

  • AUDIENCE: Jed.

  • SPEAKER 1: Jed, nice to see you-- 29, formerly 26.

  • And your name?

  • AUDIENCE: Erin.

  • SPEAKER 1: Erin, nice to see you.

  • You shall be 34.

  • All right, so what we have here is seven elements, six of which

  • are very similar to themselves, one of which is fundamentally different.

  • So, Scully here represents first, and indeed her sheet of paper

  • is horizontal to suggest that she is just a node.

  • She is just going to be the pointer of to a node in a list.

  • Everyone else's nodes are vertical, as have

  • been the rectangles we've been drawing on the screen,

  • because each of these guys represents a number as well as a next pointer.

  • Now of course, you're only seeing in front of you the number.

  • So we're going to go ahead and if you wouldn't mind,

  • use your left hand to represent the arrow that we've long had on the screen

  • to point to the person next to you.

  • And Erin, you're a bit of an anomaly, but also

  • because you need to have a null pointer at the end of the list,

  • so that you're not just pointing aimlessly.

  • And pointing to the ground seems fine, so literally

  • pointing to the ground will represent-- will infer as null.

  • So, Scully, you are the only thing keeping this list together,

  • so to speak.

  • So, you two need to point with your one pointer to Maria there.

  • So, here we have a linked list.

  • And just to make this clear, could everyone separate from each other

  • by a step or two in any direction?

  • Notice that the list is still-- it's almost identical to before.

  • Can some of you take a step forward, a step back,

  • but still point at the other person?

  • So, now we're capturing a little more accurately the fact

  • that these nodes and these pointers can be anywhere in memory,

  • so long as they're linking themselves together by way of these pointers.

  • All right, so suppose now that I want to insert

  • an element like 55, which happens to belong at the end of this list.

  • Let me go ahead and malloc Christian if I could.

  • So we have asked malloc for a chunk of memory equivalent

  • to the size of one integer and one pointer.

  • That is going to be represented with this rectangle here.

  • Nice to see you.

  • AUDIENCE: Nice to see you.

  • SPEAKER 1: You shall be 55.

  • And I'll play the role of the temporary pointer as predecessor pointer

  • or pointer just using my left or right hand

  • to try to figure out where Christian belongs.

  • So, just like you might-- just like we might want to search the list,

  • inserting is fundamentally rather the same.

  • The only thing I have access to at the outset of this algorithm

  • is my first pointer, and it's only by way of Scully

  • that I can even access the rest of the list.

  • I cannot jump to the middle, jump to the end.

  • I can only start at the beginning and literally follow the pointer.

  • So, let me go ahead and do that.

  • Let me go ahead and point at whatever Scully

  • is pointing at, which happens to be Maria, which is the number 9.

  • 55, of course, is bigger than that, and I

  • do want to keep the list sorted for today's purposes.

  • So I'm going to very carefully follow the next pointer, 17.

  • Follow the next pointer, 20.

  • Follow the next pointer, 22.

  • Follow the next pointer, 29.

  • Follow the next pointer, 34.

  • Follow the next pointer, ah dammit, null.

  • And so this is why it is important with some of these algorithms

  • to have a predecessor pointer, a second pointer or really my left hand

  • so that maybe my left hand can still point at Erin.

  • My right hand can realize, ah, null, so that I still

  • have access to the last node in the list so that Christian--

  • if you could come over here.

  • I'm going to go ahead and tell Erin quite simply to point at Christian.

  • Good, and let's just say for students' sake come on over here,

  • but technically we could have left Christian there and just had

  • Erin pointing at him.

  • It's just going to get a little confusing before long,

  • so we'll just cheat and move you right over here.

  • But now we have a linked list that has one additional member.

  • Suppose now that we want to make another insertion-- pardon.

  • Let me go ahead and propose that we insert say the number 5.

  • Well, the number 5, of course, belongs at the beginning of the list.

  • So you know what?

  • I need to malloc.

  • Can I malloc Jordan off camera five, perhaps?

  • So malloc, a very slow return value.

  • OK, we're going to store your node your n value five here.

  • His pointer is undefined right now, because he's not actually

  • pointing at anything.

  • And so where does he ultimately belong?

  • Well, he belongs at the start of the list.

  • So, let me deliberately make a mistake.

  • Let me go ahead and update Scully to point at Jordan,

  • thereby putting Jordan effectively at the front of the list.

  • Unfortunately, whom should Jordan now point at technically?

  • It should be Maria, but this is code.

  • The only thing we can do is copy pointers in memory,

  • and if Scully's left hand is no longer pointing at Maria,

  • I have literally orphaned the entirety of this list.

  • I have leaked 1, 2, 3, 4, 5, 6, 7 chunks of memory,

  • seven nodes, because I got my order of operations out of order.

  • Indeed, I should have done what-- let's undo, control Z.

  • And now let me go ahead and do what?

  • Jordan should point at the exact same thing Scully is pointing at,

  • which has no downside.

  • Even though it feels redundant, we've not lost any information.

  • And now that Jordan is pointing at Maria,

  • Scully's pointer can be pointed at Jordan.

  • And now, even though the list looks a little weird,

  • this is a key feature of the linked list.

  • These nodes could have been malloc from anywhere.

  • So indeed, even though we initially kept everyone physically sorted

  • left to right-- and you've all cleaned the list up even since-- that's OK.

  • The point is that all of these nodes are linked together.

  • So, thank you so much to our volunteers.

  • You can keep these pieces of paper and later on we'll

  • have some stress balls for you.

  • But that's the key idea here behind a linked list.

  • Thank you.

  • So, of course, there are some more complicated operations

  • that we might have to deal with.

  • For instance, if we want to insert into the middle of the list,

  • that's going to be a little more of a burden on me, the program,

  • keeping track of where things have to go.

  • But nicely enough, there's only these three cases--

  • the beginning of the list, the end of the list, and the middle of the list,

  • because middle of the list doesn't have to mean literally the middle,

  • just anywhere that's not the beginning or the end.

  • Of course, we should be careful to make sure

  • that we handle the empty list scenario, which

  • is equivalent to putting something at both the beginning of the list

  • and the end of the list.

  • But that would be perhaps a special case we could deal with separately.

  • Of course, there are other operations like inserting-- or rather removing

  • from the tail of the list, removing from the head of the list,

  • and removing in the middle.

  • And that would be the opposite of malloc, if you will.

  • And in those cases, we have to take care to call

  • our friend free to free those bytes of memory,

  • give them back to the operating system so that we don't leak memory.

  • But there, too, I'm probably going to have to be careful as to what order

  • I change my pointers and free nodes.

  • Because what you don't want to do, and what unfortunately you

  • might very well accidentally do at some point,

  • is free a pointer and then try to access that pointer or change the pointer,

  • even after you've told the operating system I'm done with this address.

  • That can give you what's called a segmentation

  • fault, which is just one of the ways in which you

  • can deduce that kind of mistake.

  • So, let's actually implement one of these methods.

  • And we'll pluck off one that allows us to actually take

  • a look at the syntax with which we can manipulate pointers.

  • And let's go ahead and implement a function

  • called search, for instance, where search

  • I [? proposed ?] just returns a bool, true or false, this number n

  • is in the given the list.

  • And now, why have I said node star list?

  • Well, at the end of the day, a linked list is just a whole bunch of nodes.

  • But the first of those nodes that we keep calling first is of what

  • data type?

  • If you have a pointer, a variable, that's

  • pointing to a linked list, that means it's storing the address of a node,

  • otherwise known as a node star.

  • So, this would be the syntax with which you can pass to a function something

  • like a linked list.

  • You simply have to pass it a pointer to the first element in that list.

  • And if I want to go ahead now and implement this,

  • let me go ahead and propose the following.

  • Let me go ahead here and give myself a temporary value, so node star pointer

  • we'll call it, PTR.

  • And that's going to equal the start of the list.

  • So, I'm just creating another box of memory

  • and I'm storing inside of it the same address

  • that I was passed in, just so that I have a temporary variable that I

  • can use to update.

  • After this, let me go ahead and say while that pointer is not

  • equal to null-- because recall that null is this special sentinel value that

  • means end of the list.

  • So inside of this loop, what do I want to do?

  • I'm going to go ahead and say if pointer--

  • and now I have to get at the number inside of it.

  • So, if I recall from the last time, we only spent a little bit of time

  • on the student example, but we said something like student

  • dot name or student dot dorm.

  • And in this case I'm inclined to say pointer dot n, where n is

  • the number, the integer that's inside.

  • But pointer this time is not a struct, per se.

  • It's the address of a node.

  • It's the address of a struct.

  • And so, perhaps the most intuitive piece of syntax in C,

  • at least retrospectively now, is that if you

  • want to access a piece of data that's inside of a node

  • and you have a pointer to that node much like our arrows in the pictures imply,

  • you literally draw an arrow using a hyphen

  • and then using a right angle bracket.

  • So, now if we do see-- whoops, let me finish my thought.

  • If pointer n equals equals the n we're looking for, let me go ahead in here

  • and say return true.

  • Or else, let me go ahead and not return false,

  • because I don't want to just check one element and then blindly say false.

  • I instead want to say pointer should get pointer arrow next.

  • And then only after that loop is all complete

  • should I say something like nope, return false.

  • So, what's actually going on here?

  • The function declaration, again, took in two arguments--

  • one, an int n that we're looking for, two a pointer to a node,

  • otherwise known as a node in a linked list.

  • And per the pictures we've been drawing, you

  • can access any other element in that linked list by way of the first element

  • in that list, as suggested here.

  • So, now I'm just giving myself a temporary variable called pointer,

  • but I can call it anything I want.

  • And I'm declaring it as node star, so that it can store the address of a node

  • as well, and then I'm just initializing it to be

  • the exact value that was passed in.

  • So, I don't want to accidentally break the list that I was passed.

  • I don't want to change the value of my parameter

  • unnecessarily and complicate things.

  • I just really want a temporary variable, much

  • like I in the world of loops that allows me to constantly iterate

  • through something and update it as I go while the whole along the way

  • I want to be checking this.

  • While pointer is not null.

  • If pointer is null, that means I'm at the end of the list,

  • or maybe more curiously, I was passed null, in which case there is no list.

  • And that's a valid scenario that could happen, even though it's a bit strange.

  • But if pointer is null, I don't want to proceed further inside of this loop.

  • But so long as pointer is not null, let me go ahead and do this.

  • Let me follow that pointer and go inside that node and say

  • is your n value equal equal to the [? end ?]

  • value that I've been asked to search for?

  • And if so, return true.

  • I don't want to just return false now because otherwise I'd

  • only ever be checking the first element in a linked list.

  • So, I now want to do the equivalent in spirit of i plus plus.

  • But I'm not using i's.

  • I don't need to use pointer arithmetic here,

  • and indeed it won't work because I have this thing stitched together.

  • It's not an array of contiguous memory.

  • I want to say that my current temporary value

  • pointer should get whatever pointer arrow next

  • is, and then let this loop continue.

  • If it's not null, check again.

  • If it's [? end ?] value equals what I'm looking for and repeat, repeat,

  • repeat until pointer equals null.

  • So, let's make this more concrete.

  • Let me go ahead and just draw a temporary picture here.

  • And let me suppose here that what I have been passed

  • is something like the following.

  • Let's do a very simple linked list that has maybe the number one,

  • and has the number two, and has the number three.

  • And, again, I've drawn gaps between these nodes,

  • because they could be anywhere in memory.

  • So, technically they don't need to be left to right like this,

  • but that'll keep us sane.

  • And if this is indeed a correct linked list,

  • there are pointers in each of those fields

  • that point to the next node in the list, and that slash

  • I drew in the last one just means null.

  • You can draw it however you want.

  • But for a linked list to work, we need to know the beginning of this thing.

  • So, we'll call this first, and that of course

  • has to point to the first element in my linked list.

  • So, here's the state of our world.

  • It's a linked list quite similar in spirit to the other one.

  • It's a little shorter, just for the sake of discussion.

  • And now, let's consider my function search,

  • which again takes two arguments.

  • So, that the first argument is of type int called n.

  • And suppose I'm searching for the number three.

  • The second argument is a node star, so the address of a node called list.

  • So, what does that mean?

  • When this function search is called, let's

  • suppose that we currently have the value n, which

  • is going to be 3, because that's arbitrarily

  • the number of decided to search for.

  • And then this other value pointer is going to be initialized-- sorry,

  • not that.

  • List is going to be whatever pointer is passed

  • in as the second argument to this function.

  • So, let's suppose that this linked list, this sample linked list at the top,

  • is indeed what is passed in to this function.

  • So, I've passed in 3, because I want to search for 3.

  • So what goes here?

  • If I pass this sample linked list into my search function,

  • what is the value of list?

  • Well, list if I'm past this first pointer

  • is really going to be the pointer to the first element in the list.

  • That's all we're saying.

  • Node star list just means give me the address

  • of a linked list, which means give me the address of the first node

  • in the linked list, which means that initially when

  • I call search my picture-- my stack frame, if you will,

  • in terms of my local arguments-- is going to look like this.

  • All right, so with that said, how does this code work?

  • We recall in this code have this while loop that

  • just sits in a loop checking whether the current nodes

  • n equals equals the one we're looking for, and if not, it updates it.

  • So, we need one more local variable called pointer that's

  • initialized to the start of this list.

  • So, this will be a pointer and it's initialized to the same thing

  • that my second argument is initialized to.

  • So, this now is the state of our world once one line of code has executed,

  • that very first one in there.

  • So, now let's implement the loop.

  • While pointer does not equal null, so here's pointer.

  • Does it equal null?

  • No, because if it did, we would just draw

  • a slash or some other piece of syntax.

  • But it's pointing at clearly something that exists.

  • So, this node here has some valid address.

  • Pointer is pointing at it, so it's not null.

  • So, what do I do inside of my code?

  • I check if the n value inside of pointer, PTR,

  • equals equals the number I'm looking for,

  • and if so, return true, otherwise, if not I update pointer.

  • So let's check.

  • Let's follow the arrow, PTR, and look at the value n.

  • Recall that the top of these boxes is n.

  • The bottom of them is called next-- n, next, n next.

  • So, I followed this pointer.

  • I'm looking at the box called n.

  • Does 3 equal 1?

  • No, obviously not.

  • So I update-- pointer gets pointer next.

  • So, to be clear, pointer gets pointer next,

  • that second to last line of actual code.

  • So, what does that mean I need to do?

  • That means I need to update pointer to be equal to pointer next.

  • What is pointer next?

  • Well, here's pointer and here's next.

  • We were looking a moment ago at n.

  • Now I'm looking at next.

  • So, pointer next means that I should update whatever is inside this box--

  • and a lot more on the screen-- to be equal to pointer next, which

  • is this field.

  • This field is pointing at that, so that line of code

  • has the effect of updating PTR to simply point at the second node.

  • So, what happens next?

  • I seem to still be in that loop and I say, well, pointer does not equal null,

  • and it doesn't.

  • It's pointing at that second node.

  • If pointer arrow n equals equals n, but no that's not

  • the case, because I'm looking for three.

  • I'm pointing at two, so that is again false.

  • So, again, I don't return true.

  • I instead update pointer to equal pointer next.

  • So, what has to happen here, at the risk of deleting my handiwork again,

  • now pointer gets pointer next, which is this element, which

  • is equivalent to pointing at this node here.

  • And so, now I'm still inside that loop while pointer-- not equal to null.

  • It's not null.

  • If pointer arrow n equals equals n, well, let's follow that logic.

  • If pointer, follow the arrow, n equals equals n,

  • three-- which is the one I'm looking for-- returned true.

  • And so, how then does this function ultimately behave?

  • It would seem in this case to return true, because I have eventually

  • found that number three.

  • What would happen by contrast if I were looking not for three, but for four

  • with this code?

  • In other words, what if I'm not looking for three and I want to go one

  • step further?

  • Well, one step further is going to update PTR

  • to equal null, that slash in my last node.

  • And that means code wise, I'm going to break out

  • of that loop, because pointer now does equal null.

  • And so, by default that very last line of code return false, not found.

  • So, complicated at first glance, and it certainly

  • looks more complicated than things we've written before,

  • but again, if you go back to basics, what does each of these lines mean?

  • Consider that there's no magic here.

  • This first line means give me a variable that's a pointer to a node.

  • It'd be in other words the address of a node

  • and assign it whatever I was passed in.

  • While pointer [? naught ?] equals null, we've seen null before.

  • It's this special zero value, and I'm just

  • making sure that the pointer I'm using, PTR, does not equal that special value.

  • And then inside of this loop I'm using one piece of new syntax, this arrow

  • notation, which just like the picture suggests means go there, and then

  • look at the field called n and check if it equals the n you're looking for,

  • and if so return true.

  • Otherwise, update yourself much like i plus plus

  • but specifically update pointer to be whatever the value is when you

  • follow the arrow in that next field.

  • So, this of course is just search.

  • We've not actually changed the list, but imagine, if you will,

  • that you could now implement insert and delete,

  • not simply by following these pointers but actually changing

  • the value of next in a node to the left, a node to the right, or a new node all

  • together.

  • So, who cares?

  • Why did we add all of this complexity?

  • We had arrays, which were working really well for a whole bunch of weeks,

  • and now we've claimed that arrays are not so good.

  • We want to use linked lists instead.

  • But why might we want to use linked lists?

  • Well, linked lists gives us dynamism.

  • We can call malloc and give ourselves more, and more, and more nodes

  • and grow our list of numbers, even if we don't know in advance how

  • many such numbers we need.

  • And we can shrink them, similarly, so we don't

  • have to allocate a massive array unnecessarily.

  • We can shrink our data structure based on how many numbers we actually need.

  • But we're paying a price.

  • Search is a little bit slower, delete is a little bit slower.

  • Insert would be slower if we insist on keeping things sorted,

  • so we've paid this price.

  • And indeed, this is thematic.

  • In CS50, in computer science more generally,

  • there's often going to be these trade-offs of time,

  • or space, or just complexity, or your human time.

  • Any number of resources can be in scarce supply.

  • And, indeed, we've seen by way of linked lists

  • that we're solving one problem while introducing another.

  • It's like that silly situation you might see

  • memes of where you cover your hands-- put your hand around a hose that

  • has a leak and all of a sudden another leak springs up over there.

  • We're just moving the problem elsewhere, but maybe that leak

  • is less problematic to us than this one here.

  • So, again, it's this theme of trade-offs.

  • Now, this here is Mather Dining Hall.

  • And this, of course, is a whole bunch of trays

  • where you might go and get some food, lunch, breakfast,

  • or dinner, or the like, and you pick up this tray and you put food on it.

  • But what's interesting about trays, as Mather and a lot of cafeterias do,

  • is trays are stacked one on top of the other.

  • And it turns out now that we have this second building

  • block with which to create data structures-- we're not just

  • using arrays anymore.

  • We now have pointers and this general idea of linking nodes together

  • in our toolkit, we can now start to imagine more interesting data

  • structures that solve problems in slightly different ways.

  • For instance, suppose that I wanted to implement

  • this paradigm of stacking things on one on top of the other like this.

  • Indeed, this is a data structure called a stack,

  • and it generally has two operations associated with it,

  • push to push something on the stack and pop to take something off of the stack.

  • And this is perhaps a useful data structure

  • if you just want to store numbers or something else in really just

  • the most efficient way for you without regard really to fairness.

  • So, for instance, if this table here is my initial data structure

  • and it's empty and I have a piece of information that I want to store,

  • I'm just going to going ahead and put it right there.

  • And now suppose I want to push another number onto the stack.

  • I'm just going to go ahead and simply put it

  • on top, third number on top, fourth number on top.

  • But I've now committed to a certain property, if you will,

  • a certain property whereby the last tray in has to be the first tray out,

  • otherwise known in computer science as LIFO-- last in,

  • first out-- because if I want to get that first number, I mean,

  • I've created a mess for myself.

  • I have to lift these all up or move them just to get at it.

  • That just seems stupid.

  • Intuitively, the easiest thing to grab is probably going to be the top,

  • but that's not necessarily the first element I put in.

  • But that's OK.

  • This might still be a valid data structure--

  • and indeed later in the term when we introduced web programming

  • and we look at languages like HTML, there's

  • actually a number of applications where it's actually super useful

  • to have a data structure where you can just stack stuff on top of each other

  • in order to tuck some data away for subsequent use.

  • And, indeed, when we talked about memory in our own computer,

  • stacks clearly have some value.

  • We talked about main and then swap, and then maybe other functions.

  • There are many contexts, one of which we've seen already, where in life you

  • actually want to stack things on top of each other

  • so as to keep track of really what did I do most recently, because that's

  • the next problem I'm going to deal with or the next frame--

  • in the case of memory-- that I'm going to pop off of the stack.

  • So, how might we implement this data structure?

  • Let me propose that if we want to define our own data type

  • called the stack that implements that idea of cafeteria trays or frames

  • in memory, let me go ahead and typedef a structure called stack inside

  • of which are two data members.

  • Suppose that, for the sake of discussion,

  • I'm not going to try to store trays or something that doesn't really

  • exist computationally but rather numbers, just integers.

  • Inside of this structure called a stack is

  • going to be an array called numbers of type int

  • and it's going to store capacity.

  • So capacity, let's assume, is hash defined elsewhere to be some constant.

  • So, maybe it's 10, maybe it's 1,000, but it's some constant integer elsewhere

  • that limits, ultimately, the capacity of the stack to some integral value.

  • And then size-- this seems weird.

  • I have capacity here and then size here.

  • Well, there's the semantic distinction here.

  • Just because you have a stack ready to go,

  • as I did a moment ago-- just because I had this stack ready to go,

  • empty initially, it's going to have some capacity.

  • Realistically, I can only go as high as the ceiling

  • or until the things fall over.

  • But there's also a size here, and the size is currently zero,

  • but the capacity might be like 1,000 or whatever.

  • So, that's the difference there and the size now is 4

  • and the capacity is like 1,000 minus 4 at this point--

  • or rather, capacity is still 1,000, because that's the total possible size,

  • not the actual size.

  • But what if that's a limit?

  • What if I don't want to restrict myself to some fixed, finite number of trays

  • or a fixed number of numbers?

  • Well, I could instead declare my stack as being a pointer.

  • Now, this pointer initially has no value,

  • so let's assume that it's probably going to be initialized to null in my code,

  • but that too is not going to be useful.

  • Why would I declare numbers now not to be in an array, which

  • felt very straightforward.

  • Normally we draw arrays left to right, but with the trays,

  • just imagine turning an array vertically and thinking of the number stacked

  • on top of each other.

  • But this is just a pointer to a number.

  • Why might I want to implement a stack as just a pointer to an int?

  • That seems wrong.

  • I want lots of numbers, not one number.

  • So, what could I do?

  • Well, what if in this world to implement a stack I invoke our friend malloc

  • and I say to malloc, malloc, give me enough memory

  • for 2,000 numbers or 5,000 numbers.

  • What is malloc going to return?

  • Well, by definition, we know malloc is going

  • to return the address of a chunk of memory, and that chunk of memory

  • is going to be of whatever size I ask malloc for,

  • and the address of the first is really just equivalent to the address

  • of one integer.

  • And so long as I, the programmer, remember that I asked malloc

  • for 2,000 integers or for 5,000 integers,

  • I know implicitly the end of that chunk of memory and malloc

  • just need to tell me the beginning.

  • So, it's perfectly fine to implement the stack by way of a single pointer,

  • because all I need to know is, hey, malloc,

  • where should I put my first integer?

  • Because I know via pointer arithmetic, per last week,

  • that I can put my next integer four bytes later,

  • four bytes later, four bytes later.

  • And I'm deliberately going up this time, but it really

  • is just an array where you can think of the array as left and right.

  • So, this would be a way of giving ourselves a data structure called

  • a stack that is not fixed from the outset like this previous version

  • to some specific capacity.

  • Now, we are limited only by how much physical memory or virtual memory

  • my computer actually has.

  • So, suppose Apple or someone similar implemented the lines

  • outside their stores for the release of the iPhone as a stack.

  • So, it's weird maybe to think of people stacking on top of each other,

  • but maybe you could imagine Apple funneling everyone into the glass store

  • here in Manhattan, and then whoever is the last one in gets their phone first.

  • Because why?

  • They're closest to the exit.

  • So, you have all these people show up super early in the morning or days

  • before, you pile them all into the store saying everyone, hey,

  • please go into the corner there.

  • Please get into the store.

  • And then as soon as 9:00 AM rolls around and it's

  • time to give out the iPhones, just for logistical convenience you realize,

  • all right, why don't we just give the person who came in

  • last their phone first because they're closest to the exit

  • and get them out, last in, first out?

  • Good design, bad design?

  • It's correct in so far as everyone's going

  • to get an iPhone if supply is there, and that's never going to be the case.

  • So, it's not necessarily very equitable or fair,

  • and indeed the humans are not going to be very pleased with Apple

  • if they used a LIFO data structure or a stack.

  • What would these fans of Apple hardware prefer that Apple use?

  • We call it a line.

  • If you go to the UK, they call it a queue,

  • which is actually a perfect answer, because there's this other data

  • structure in the world called a queue, which

  • is exactly what you would hope the Apple store line would be,

  • a line whereby it's first in, first out.

  • So, the first person there three days before,

  • at 5:00 AM gets his or her phone first, and the one person

  • who comes in at 9:01 AM doesn't get their phone

  • because they're at the last position in the queue or the list.

  • And a queue, nicely enough, might just have at least two operations--

  • enqueue and dequeue whereby enqueue means get

  • into line d queue means get out of the line,

  • but these happen at different places.

  • For instance, if there's a whole bunch of people lined up here on the stage,

  • closest over there's the front of the list.

  • I get here last.

  • I enqueue myself at the end of this data structure,

  • but you dequeue someone from the beginning of the list.

  • By contrast, when we had a stack, when you push someone onto the stack,

  • you pop it off, or him or her off first by nature

  • of it being a LIFO data structure.

  • So, how might we implement a queue?

  • It's actually slightly more complicated, 50% more

  • pieces of information you need to keep track of, the front of the list.

  • But you can still do it in an array.

  • So, suppose that we do use an array, and let me go ahead and draw this

  • as follows.

  • Suppose that like hopscotch we draw the queue for an Apple Store

  • like an array like this.

  • And here is the door of the Apple store, so you want to be at location zero,

  • ideally.

  • 1, 2, 3, 4, 5, 6-- so this is how many people

  • can fit into our queue in this case.

  • So, suppose that Alice wants to buy an iPhone and she gets to the store first.

  • Where should she go to keep things fair?

  • This is the queue, so we don't want to put her into the corner,

  • so to speak, in our first example.

  • We want to put her at the front of the list.

  • So, Alice belongs right there, pretty straightforward.

  • Now, Bob arrives and he comes in slightly after Alice,

  • so he gets to get behind Alice in line.

  • And so Bob is there, and maybe Charlie arrives thereafter, and then so forth.

  • David maybe comes in fourth and beyond.

  • So, that's how people would queue up, so to speak.

  • And now, when it's time for Apple to open this door

  • and start selling iPhones, what happens?

  • We want to take Alice out of the list first.

  • We want to de-queue Alice.

  • So, we need to start remembering some information,

  • because it's not sufficient now to just remove-- whoops,

  • it's not sufficient just to remove Alice like this,

  • because suppose that we do keep adding other people, person

  • d, e-- whoops, that's not an f.

  • OK, don't know what happened there.

  • Person f is here, g, h.

  • Suppose that Alice has bought her phone and left the store,

  • and then Bob does the same.

  • He goes ahead and leaves the store, and then Charlie leaves the store.

  • Where do I put person i who maybe shows up a little late?

  • It would seem that I want to put them at the end of the queue, which

  • makes good sense, but right now d, e, f, g, and h are still in the queue,

  • and this is an array I proposed.

  • My data structure's an array, so I can't just move d to the front of the line

  • easily.

  • I have to actually shift him or move him,

  • and this might conjure up some memory of our searching and sorting examples

  • where when we had our humans on stage, we actually

  • had to physically move people like an insertion sort

  • to make room for those elements.

  • And that's fine.

  • We can absolutely say, hey, David, come over here please and person e,

  • come over here please, and that's obviously what the Apple store does.

  • But when you're implementing this idea in memory,

  • you can't just ask the numbers themselves or the elements themselves

  • to do that moving.

  • You need to do it for them, and that's going to cost you time,

  • and that's going to be a price that you have to pay.

  • But I propose that we can be clever here.

  • We do not need to incur the cost of moving d, e, f, g h where Alice, Bob,

  • and Charlie previously were.

  • Where can we put person i instead?

  • I mean, there's obviously room in the line,

  • so maybe why don't we just put person i here?

  • But, again, we don't want to piss off everyone who's already in the line.

  • So, if this now is an array, we have to be mindful of the fact

  • that the front of this list has to be remembered separately.

  • This data member here front really should store not 0 in perpetuity

  • but really 0, 1, 2, 3.

  • It should store the current front of the list.

  • And I need another variable, presumably, called

  • size that keeps track of how many elements are in the list, which

  • in this case is going to be six.

  • So with a queue, if I'm implementing it using an array,

  • there's some added complexity if I want to avoid

  • the inefficiency of moving all of these elements

  • and incurring the kind of running times we saw when we talked about searching

  • and sorting the other day.

  • There's no reason mechanically to move d, e, f, g, h anywhere in the array.

  • We can in constant time, and maybe our old friend the modulo operator

  • that you might have used in [INAUDIBLE], we

  • can just figure out where i, and j, and everyone else

  • should go so long as we keep track separately

  • with a queue of what the front of the list would be.

  • And why is this 3?

  • Well, if I continue numbering the array like this, as we've often done,

  • you can now see that d is the head of the list, or the front of the list.

  • And so, we should remember his location there as 3.

  • But, again, what happens if j, k, and then l shows up?

  • There is no room for l in this world, not to mention m, n, o.

  • So what if we solved this problem as before

  • by changing the array from having some fixed capacity to having

  • no pre-determined capacity, just use a pointer so that we

  • can use malloc to dynamically allocate a big chunk of memory,

  • remember its capacity ultimately, but also

  • remember the front and the size of this data structure?

  • So, the same idea there might apply.

  • So, at the end of the day, what have we done?

  • We've taken these new building blocks, pointers,

  • and this notion of linking things together using pointers

  • much like a linked list, and we've looked back

  • at our data structure called an array.

  • And using these now, we can implement what are generally called abstract data

  • types, a queue in a stack does not have as a low level

  • meaning as an array does, which is a very technical concept.

  • And in a linked list, this is a very technical concept.

  • It's a node with pointers linking things together.

  • A stack is like a stack of cafeteria trays,

  • or a queue is something like people lining up outside of an Apple Store.

  • These are abstract data types that can be implemented clearly

  • underneath the hood in at least a couple of different ways.

  • You can use an array and keep things simple, a la

  • weeks two and beyond in the class, but you're

  • going to paint yourself into a corner by fixing their size,

  • as I did a moment ago, by declaring this queue

  • and before it a stack to be of a fixed capacity.

  • But now that we have pointers, and malloc, and dynamic memory allocation,

  • and this spirit of linked lists, we can change

  • that to actually be numbers and actually just

  • remember where things are underneath the hood.

  • And nicely enough, a stack in a queue doesn't even

  • need me to stitch things together like a linked list.

  • I just need malloc in order to allocate those.

  • So, let's tie these two topics together if you would

  • and compare and contrast them by way of a wonderful animation

  • that a fellow educator made and posted online

  • that we've abbreviated here to give us a sense of the difference between stacks

  • and queues.

  • [AUDIO PLAYBACK]

  • [MUSIC PLAYING]

  • -Once upon a time, there was a guy named Jack.

  • When it came to making friends, Jack did not have the knack.

  • So, Jack went to talk to the most popular guy he knew.

  • He went to Lou and asked what do I do?

  • Lou saw that his friend was really distressed.

  • Well, Lou, began just look how you're dressed.

  • Don't you have any clothes with a different look?

  • Yes, said, Jack.

  • I sure do.

  • Come to my house and I'll show them to you.

  • So, they went off to Jack's, and Jack showed Lou the box

  • where he kept all his shirts, and his pants, and his socks Lou

  • said I see you have all your clothes in a pile.

  • Why don't you wear some others once in a while?

  • Jack said, well, when I remove clothes and socks,

  • I wash them and put them away in the box, then comes the next morning

  • and up I hop.

  • I go to the box and get my clothes off the top.

  • Lou quickly realized the problem with Jack.

  • He kept clothes, CDs, and books in a stack.

  • When he reached something to read or to wear

  • he chose [? the top ?] book or underwear.

  • Then when he was done he would put it right back,

  • back it would go on top of the stack.

  • I know the solution, said a triumphant Lou.

  • You need to learn to start using a queue.

  • Lou took Jack's clothes and hung them in a closet,

  • and when he had emptied the box he just tossed it.

  • Then he said, now, Jack, at the end of the day put your clothes

  • on the left when you put them away.

  • Then tomorrow morning when you see the sunshine,

  • get your clothes from right from the end of the line.

  • Don't you see, said Lou.

  • It will be so nice.

  • You'll wear everything once before you wear something twice.

  • And with everything in queues in his closet and shelf,

  • Jack started to feel quite sure of himself all thanks

  • to Lou and his wonderful queue.

  • [END PLAYBACK]

  • SPEAKER 1: All right, so let's take a look at another data type,

  • this one known as a tree.

  • Because now that we have the ability to stitch data structures together much

  • like a linked list, we now have the ability

  • to stitch things together not just left to right or top to bottom conceptually,

  • but in any number of directions.

  • And indeed, there's nothing stopping us from having

  • one node linked to by way of multiple pointers, multiple nodes.

  • So, for instance, this picture here from a textbook is a tree structure.

  • And it's very much like the family trees that you might

  • have drawn in grade school or the like.

  • But in this case, you have just one root node,

  • the node at the top of the data structure, so to speak,

  • from which everything else descends.

  • And that node is said to have children.

  • For instance, 2 and 3 are children of the node number 1 here.

  • And then there's other semantics in this world of trees in computer science.

  • Much like family trees, anything that does not

  • have children-- like 5, 6, and 7, or 8 and 9--

  • would be called leaves of the tree, because like the leaves

  • at the end of the branches, there is nothing beyond them.

  • So, nicely enough we borrow a lot of the language

  • from family trees and actual trees in order

  • to discuss this data structure known as a tree.

  • But why in the world would we want to lay out data in a tree structure?

  • Now we just seem to be doing things because we can,

  • it would seem at first glance.

  • Because, for instance, suppose we had these numbers-- 22, 33, 44, 55, 66, 77,

  • and 88.

  • They're clearly sorted.

  • And suppose that I wanted to lay these out in a data structure

  • and be able to search them efficiently, assuming the whole time

  • that they are indeed sorted.

  • Well, if we wanted to do that, we have our old friend arrays from weeks ago.

  • And we also have our old algorithm from Mike Smith, our binary search

  • algorithm, divide and conquer.

  • And we can find nodes in this data structure super,

  • super fast in logarithmic time, big O of log n.

  • So, we've solved that problem.

  • But it turns out we don't necessarily have to use an array laying out data

  • from left to right because, again, one of the prices we pay of using arrays

  • where as we've realized today is this finiteness.

  • At the end of the day, the size of an array is fixed.

  • You have to decide in advance how big your array is going to be.

  • So, what if you want to add more numbers to it?

  • What if you want to remove numbers for efficiency

  • and not waste so much memory?

  • You can't really do that with an array.

  • You can, but have to jump through some hoops.

  • You have to reallocate the array, as with a function like [? re-alloc ?]

  • if you indeed used malloc in the first place to allocate it.

  • But then you have to copy the old array into the new array,

  • so it's all possible.

  • Nothing's impossible once you have a keyboard at your disposal,

  • but it's a lot of work, and it's more time,

  • and it's expensive there for both in terms of your time

  • and the computer's time.

  • But could we achieve the beauty of divide

  • and conquer and binary search from week zero without the constraints

  • that arrays impose?

  • And today, the solution to all of our array problems

  • seems to be linked lists or more generally pointers so

  • that, one, we can dynamically allocate more memory with malloc when we need it

  • and then use pointers to thread or stitch together

  • that node with any existing nodes.

  • So, indeed let me propose this variant on a tree structure

  • that the world calls binary search trees or BSTs.

  • Binary in this case means two, and this just means that every node in this tree

  • is going to have 0, 1, or 2 maximally children.

  • And now, in this case binary search tree means that for every node in the tree

  • it's left child is less than it and its right child is greater than it.

  • And that's a recursive definition.

  • You can look at the root of this tree and ask that same question.

  • 55?

  • Is it greater than its left child?

  • Yep.

  • Is it less than its right child?

  • Yep.

  • That is the beginning, it would seem, of a binary search tree.

  • But it's recursive in so far as this is indeed a binary search

  • tree if that statement is true.

  • Those answers are the same for every other node in the tree.

  • 33, is its left child smaller?

  • Yep.

  • Is its right child bigger?

  • Yep.

  • How about over here, 77?

  • Left child smaller?

  • Yep.

  • Right child bigger?

  • Yep, indeed.

  • How about the leaves of the tree?

  • Is 22 greater than its left child?

  • I mean, yeah, there is no child, so yes, that's a fair statement.

  • It certainly doesn't violate our guiding principle.

  • Is it less than its right child, if any?

  • Yes, there just isn't any.

  • And so this is a binary search tree.

  • And indeed, if you took a scissors and snipped off any branch of this tree,

  • you would have another binary search tree, albeit smaller.

  • But it's recursive and that definition applies to every one of the nodes.

  • But what's beautiful here now is that if we implement this binary search

  • tree, similar in spirit to how we implemented linked lists

  • using not arrays but using pointers and not one pointer but two pointers

  • whereby every node in this tree apparently has up to two pointers--

  • let's call them not next but how about left and right just to be intuitive.

  • Well, if every node has a left and a right pointer,

  • now you can conceptually attach yourself to another node over there

  • and another node over there, and they too can do the same.

  • So, we have the syntax already with our pointers with which to implement this.

  • But why would we?

  • Well, one, if we're using pointers now and not an array,

  • I can very, very easily allocate more nodes for this tree.

  • I can insert 99 or 11 really easily, because I just

  • called malloc like I did before.

  • I put the number 99 or 11 inside of that node,

  • and then I start from the root of the tree,

  • much like I start from the first element in the linked list,

  • and I just search for its destined location going left or right

  • based on the size of that value.

  • And what's nice, too, here is notice how short the tree is.

  • This is not a linked list.

  • It's not a long list, whether vertically or horizontally.

  • This is very shallow this tree.

  • And indeed I claim that if we've got n elements in this list,

  • the height of this tree it turns out is log of n.

  • So, the height of this tree is log of n, give or take one or so.

  • But that's compelling, because how do I search this tree?

  • Suppose I am asked-- I'm trying to answer the question is 44 on my list?

  • How do I answer that?

  • Well, we humans can obviously just look back and it's like, yes, 44 is in it.

  • It's not how a computer works.

  • We have to start from what we're given, which in this case

  • is going to be as the arrow suggests a pointer to the tree

  • itself, a pointer towards first node.

  • And I look is this the number 44?

  • Obviously not.

  • 55 is greater than 44, so I'm going to go down to the left child

  • and ask that same question.

  • 33, is this 44?

  • Obviously not, but it's less than it so I'm

  • going to go down to the right child.

  • Is this 44?

  • Yes, and simply by looking at three nodes

  • have I whittled this problem down to my yes no answer.

  • And indeed, you can think of it again with scissors.

  • I'm looking at 55 at the beginning of this story.

  • Is 44 55?

  • No, 44 is less.

  • Well, you know what?

  • I can effectively snip off the right half of that tree,

  • much like I tore that phone book in week zero, throwing half of the problem

  • away.

  • Here I can throw essentially half of the tree away and search only what remains

  • and then repeat that process again, and again, and again, whittling

  • the tree down by half every time.

  • So therein lies our logarithmic running time.

  • Therein lies the height of the tree, so long

  • as I am good about keeping the tree balanced.

  • There's a danger.

  • Suppose that I go ahead and start building this tree myself in code

  • and I'm a little sloppy about doing that.

  • And I go ahead and I insert, for instance, let's say the number 33.

  • And it's the first node in my tree, so I'm

  • going to put it right up here at the top.

  • And now suppose that the next number that just happens

  • to get inserted into this tree is 44.

  • Well, where does it go?

  • Well, it has no children yet, but it is bigger,

  • so it should probably go over here.

  • So, yeah, I'll draw 44 there.

  • Now, suppose that the inputs to this problem

  • are such that 55 is inserted next.

  • Where does it go?

  • All right, 55, it's bigger, so it should go over here.

  • And then 66 is inserted next.

  • All right, it goes over here-- never mind that.

  • So, what's happening to my binary search tree?

  • Well, first of all, is it a binary search tree?

  • It is because this node is bigger than its left child,

  • if any-- there just isn't any-- and it's less than its right child.

  • How about here, 44?

  • It's bigger than its left child, if any-- because there is none--

  • and it's smaller than its right child.

  • The same thing is true for 55, the same thing is true for 66.

  • So, this is a binary search tree and yet somehow what does it look like?

  • It looks like a linked list, right?

  • It's at a weird angle.

  • I've been drawing everything horizontally,

  • but that's a meaningless artistic detail.

  • It devolves potentially into a linked list.

  • And so, binary search trees if they are balanced, so to speak,

  • if they are built in the right order or built with the right insertion

  • algorithm such that they do have this balanced height,

  • this logarithmic height, do afford us the same logarithmic running time

  • that the phone book example did and our binary search of an array did.

  • But we have to do a little bit more work in order

  • to make sure that these trees are balanced.

  • And we won't go into detail as to the algorithmics

  • of keeping the tree balanced.

  • But realize, again, there's going to be this trade-off.

  • Yes, you can use a binary search tree or trees more generally to store numbers.

  • Yes, they can allow you to achieve that same binary or logarithmic running time

  • that we've gotten so used to with arrays,

  • but they also give us dynamism such that we

  • can keep adding or even removing nodes.

  • But, but, but, but it turns out we're going

  • to have to think a lot harder about how to keep these things balanced.

  • And indeed, in higher level CS courses, courses

  • on data structures and algorithms will you

  • explore concepts along exactly those lines.

  • How would you go about implementing insert and delete into a tree

  • so that you do maintain this balance?

  • And there is yet more variance on these kinds of trees

  • that you'll encounter accordingly.

  • But for our purposes, let's consider how you would implement the tree itself

  • independent of how you might implement those actual algorithms.

  • Let me propose this type of node.

  • Again, notice just the very generic term in programming

  • where it's usually like a container for one or more other things, and this time

  • those things are an integer-- we'll call it n

  • but it could be called anything-- and two pointers.

  • And instead of next, I'm going to just by convention call them left and right.

  • And as before, notice that I do need to declare struct node up

  • here or some word up here.

  • But by convention I'm just going to do typedef struct node, because C reads

  • things top to bottom, left to right.

  • So if I want to refer to a node inside of a node,

  • I need to have that vocabulary, per this first line, even though later on I

  • just want to call this whole darn thing a node.

  • And so, that's the distinction.

  • This actually has the side effect of creating a data

  • type by two different names.

  • One is called struct node, and you can literally in your code

  • write struct node something, struct node something.

  • It just feels unnecessarily verbose, so typedef

  • allows you to simplify this as just node, which

  • refers to the same structure.

  • But this is necessary for this innermost implementation detail.

  • So, now that we have the ability with a structure

  • to represent this thing, what can we actually do with it?

  • Well, here is where recursion from a few weeks ago

  • actually gets really compelling.

  • When we introduced that sigma example a while ago and talked in the abstract

  • about recursion, frankly, it's kind of hard to justify it early on, unless you

  • actually have a problem that lends itself to recursion in a way that

  • makes sense to use recursion and not just iteration,

  • loops-- for loops, while loops, do while, and the like.

  • And here we actually have a perfect incarnation of that.

  • What does it mean to search a binary search tree?

  • Well, suppose I'm searching for a number n

  • and I'm being given a pointer to the root of the tree,

  • and I'll call it tree.

  • So, just like when I was searching a linked list,

  • I'm given two things, the number I'm looking for

  • and a pointer to the first thing in the data structure-- the first thing

  • in a linked list or the first thing in a tree.

  • And in this case, we would call that first thing

  • in a tree a root, generally speaking.

  • So, the first thing I had better do in my search function

  • is check, wait a minute.

  • If tree equals equals null, don't do anything.

  • Do not risk touching any pointers, because as you may have gleaned already

  • or soon will with some of CS50's problems,

  • you will cause quite probably a segmentation fault,

  • a memory-related problem in your program such that it just crashes.

  • It literally says segmentation fault on this screen

  • if you touch memory that you should not.

  • And you should not touch null values.

  • You should not go to null values.

  • You should not do star of any value that itself might be null.

  • And so, if tree equals equals null is super, super important here,

  • because I want to make sure to immediately

  • say, well, if you hand me null, that's like handing me no tree whatsoever.

  • So, my answer is obviously false.

  • N can't be in a non-existent tree.

  • But we need that condition up top, because the next case

  • is [? noticed through ?] the following.

  • Else if n-- the value we're looking for-- is less

  • than the value of n in this node-- tree, recall,

  • doesn't refer to the whole thing, per se, in this context.

  • It refers to the current node that we've been

  • past, which at the beginning of the story is the root of the tree.

  • So, if the number n in the root of the tree is greater than the number

  • we're looking for, we want to go to the left.

  • Else we want to go to the right and search the right subtree.

  • So, what's the syntax here?

  • If the n we're looking for, like 44, is less

  • than the value at the current node, 55, then what do we want to do?

  • We want to call search, still searching for the same number n

  • but searching on the left subtree.

  • And how do you pass in a pointer to the left tree?

  • Well, you have in tree a pointer to the root node.

  • Tree arrow left just means go to my left child and past that value in instead,

  • pass its address in instead.

  • Meanwhile, if the number you're looking for

  • is actually greater than the value in the current node, search

  • the right subtree, else return true.

  • Because if the list is not null-- if there is actually a list and the number

  • you're looking for is not less than the current node

  • and it's not greater than the current node, it must be the current node,

  • so you can return true.

  • But there's one important detail here.

  • I didn't just call search.

  • I called return search in each of these two middle cases.

  • Why is that?

  • Well, this is where recursion gets potentially a little mind bending.

  • Recursion is the act of a function calling itself.

  • Now, in and of itself, that sounds bad, because if a function calls itself,

  • why wouldn't it call itself again, and again, and again, and again, and again,

  • and just do this infinitely many times such

  • that now you get a stack overflow where all of those frames on the stack

  • hit the heap and bad things happen.

  • But no, recursion works beautifully so long as every time you recurse,

  • every time a function calls itself it takes a smaller byte of the problem.

  • Or rather, put another way, it throws away

  • half of the problem, as in this case, and looks only at a remaining half.

  • Because if you keep shrinking, shrinking, shrinking, shrinking

  • the problem, you will eventually hit this base case

  • where either there is no more tree or you're looking right at the node

  • that you want to find.

  • And so, by returning search and tree left,

  • or returning search and tree right, you're deferring the answer.

  • When you, the search function, are called and asked

  • is the number 44 in this tree, you might not

  • know because the node you're looking at at the beginning of the story

  • was again 55.

  • But you know who does know?

  • I bet my left child will know the answer to that if I just

  • ask it by passing it-- passing to search a pointer to it, my left child,

  • and passing in that same number 44.

  • So, saying return search is like saying I don't know.

  • Ask my left child.

  • Or I don't know, ask my right child and let me return as my answer

  • whatever my child's answer is instead.

  • So, you could do this same function using iteration.

  • But you could solve it arguably much more elegantly here using recursion,

  • because a data structure like this-- like a binary search tree,

  • which again is recursively defined-- each node is conceptually

  • identical, if numerically different from the others,

  • allows us to apply this algorithm, this recursive algorithm

  • to that particular data structure.

  • Now, let's look at a more concrete incarnation

  • of trees that allows us to do something pretty neat and pretty real world.

  • Indeed, this is another problem borne of a real world domain of compression.

  • We talked a couple weeks ago about encryption,

  • the art of concealing or scrambling information.

  • Compression, meanwhile, is the art of taking something that's this big

  • and compressing it to make it smaller, ideally without losing any information.

  • It's pretty easy to take a 10 page essay that's

  • maybe-- that was supposed to be a five page essay

  • and just remove paragraphs from it or remove sentences from it.

  • But that changes the meaning of the paper, makes it a worse paper,

  • even though you're compressing it by making it smaller.

  • No, what most students would typically do, if you've written 10 pages

  • and it needs to fit into five, you really, really, really

  • shrink the font size or increase the margins.

  • Or maybe more realistically you write a five page paper that's

  • supposed to be a 10 page paper, and so you increase the font size

  • or increase the margins so as to expand or decompress the essay.

  • So, similarly here, what if we wanted to compress text,

  • but we want to do it losslessly in a way that we

  • don't lose any information by just throwing away

  • characters, or paragraphs, or pages, but we

  • want to use the system with which we're familiar from week zero.

  • So ASCII, again, is just this code, this mapping of letters to numbers.

  • And so, A is-- capital A is 65 and that's some pattern of bits,

  • but it's some pattern of 8 bits-- 7 historically,

  • but really 8 bits in practice So every one

  • of the characters in the English alphabet, at least here,

  • takes up 8 bits.

  • Now, that sounds fine.

  • That allows us to express as many as 256 possible characters, which

  • is more than enough for English characters, plus some punctuation

  • and so forth.

  • But it seems wasteful.

  • I type A, E, and I, maybe O and U pretty often.

  • I use the values often-- the vowels often.

  • B and D, I feel like I use those a lot.

  • I don't really type Q all that much, Z all that much.

  • So, there are certain letters that I just

  • feel like I don't type them that often, and indeed,

  • probably if we analyzed a dictionary, we wouldn't see them as frequently

  • as other letters.

  • Indeed, if you've ever played or watched Wheel of Fortune,

  • certainly all the contestants on that show

  • know which are the most popular letters in English words.

  • And it seems silly and perhaps inefficient--

  • certainly for a computer scientist-- that we are not somehow

  • embracing the fact that some letters are more commonly used than others,

  • and yet we are just blindly using 8 bits, the same amount of memory,

  • for every darn letter in our alphabet.

  • Why?

  • If you keep writing a certain letter again and again,

  • why not use fewer bits for the more popular letters,

  • and more bits for the less popular letters

  • so that at least you're optimizing for the common case, so to speak?

  • Well, it turns out that someone named Huffman years ago did

  • figure this out and introduced what's generally known as Huffman coding.

  • And, at first glance, it's a little similar in spirit to something

  • some of you might have grown up learning a little something about called Morse

  • code, but it's better in a couple of ways.

  • Morse code typically transmitted with electrical signals or audible signals.

  • It has dots and dashes where a dot is a quick beep

  • and a dash is a slightly longer beep, and you

  • can use those series of dots and dashes, as per this chart here,

  • to represent letters of the alphabet and some numbers.

  • The one problem, though, as efficient as this seems-- and then by efficient

  • I mean look at E. Mr. Morse realized that is super popular, so he

  • used literally the shortest symbol for it, just a dot, a simple blip,

  • to represent an E. And, meanwhile, as I kind of imagined,

  • Z is not that common, so dash, dash, dot,

  • dot is longer than just a single dot.

  • So Z is probably less popular, and that's why we did this.

  • And Y may be even less popular-- dash, dot, dash--

  • I don't know why I'm using this voice.

  • But it's longer than E, so we optimized for the shorter characters.

  • Unfortunately, suppose that you receive the message dot, dot, dot, dot, dot,

  • dot, so six dots in a row, and I technically paused in between them.

  • Six dots, what message did I just send you?

  • Six dots.

  • So, I wanted to say hi, so I said dot, dot, dot, dot, which is H,

  • and then dot, dot which is I. I should not have paused between them,

  • because the whole point of Morse code is to do this as quickly as possible,

  • even though you probably do want to pause to resolve ambiguity,

  • and indeed, that's the problem.

  • I wanted to send you hi, H-I, but maybe I

  • just sent you E, E, E, E, E, E, six Es in a row,

  • because those two were just dots.

  • So, in other words, Morse code is not immediately decodable

  • when you're reading, or hearing, or seeing the dots and dashes come

  • over the wire, so to speak, because there's these ambiguities,

  • unless this transmitter does indeed pause,

  • as I accidentally did there, to give you a moment to take your breath

  • and realize, oh, that was an H. That's an I. As opposed to E, E, E, E, E, E.

  • So, it's not necessarily the best system in so far as some letters

  • share prefixes with other letters.

  • In other words, I, dot dot, has a common prefix with E. Both of them

  • start with a single dot.

  • It just so happens that I is a little longer,

  • and that can lead potentially to ambiguity,

  • and it certainly means that the transmitter should probably slow down.

  • So, the whole system is meant to be super fast, super efficient,

  • but you probably should pause between certain letters

  • so that the recipient doesn't get confused as

  • to the message you're actually sending.

  • Well, thankfully Huffman coding-- which as we'll see in a moment

  • is based on trees-- does not have that ambiguity.

  • It is a immediately decodable.

  • And suppose for the sake of discussion, as per this example here,

  • you just have a whole bunch of text that you want to transmit.

  • This is meaningless.

  • There's no pattern in these As, and E, B, C, Ds, and Es,

  • but if you go through and count them up, each these letters-- A, B, C, D, E--

  • occur with some frequency in this text.

  • So, it's meant to be representative of an essay, or a message,

  • or whatever that you want to send to someone.

  • Indeed, if you count up all of the As, Bs, Cs, Ds, and Es, and divide

  • by the total number of letters, it turns out

  • that 20% of the characters in that random string are As, 10% are Bs,

  • 10% are Cs, 15% are Ds, and 45% are Es, so

  • it's roughly consistent with what I'm claiming,

  • which is that it E is pretty popular.

  • So, intuitively, [? it ?] would be really nice

  • if I had an algorithm that came up with some representation of bits

  • that's not just 8 bits for every darn letter

  • but that is a few bits for the popular letters

  • and more bits for the less popular letters,

  • so I optimize, again, so to speak, for the common case.

  • So, by this logic E, hopefully, should have a pretty short encoding

  • in binary, and A, and B, and C, and D should have slightly longer encoding,

  • so that again if I'm using E a lot I want to send as few bits as possible.

  • But I need this algorithm to be repeatable.

  • I don't want to just arbitrarily come up with something

  • and then have to tell you in advance that, hey, we're using this David Malan

  • system for binary.

  • We want an algorithmic process here.

  • And what's nice about trees is that it's one way of seeing and solving

  • exactly that.

  • So, Huffman proposed this.

  • If you have a forest of nodes, so to speak, a whole bunch of trees-- each

  • of size one, no children-- think of them as each having a weight or a frequency.

  • So, I've drawn five circles here, per this snippet

  • from a popular textbook that has 10%, 10%, 15%, 20%, 45% equivalently in each

  • of those nodes.

  • And I've just labeled the leaves as B, C, D, A, E,

  • deliberately from left to right because it will make my tree look prettier,

  • but technically the lines could cross and it's not a big deal in reality.

  • We just need to be consistent.

  • So, Huffman proposed this.

  • In order to figure out the so-called Huffman tree for this particular text,

  • in order to figure out what to encode it's letters as with zeros and ones,

  • go ahead and take the two smallest nodes and combine them with a new root node.

  • So in other words, B and C were both 10%.

  • Those are the smallest nodes.

  • Let's go ahead and combine them with a new root node

  • and add together their weights, so 10% plus 10% is 20%.

  • And then arbitrarily, but consistently, label

  • the left child's edge or arrow as a 0 and the right arrow's edge as a 1.

  • Meanwhile, repeat.

  • So, now look for the two smallest nodes.

  • And I see a 20%-- so ignore the children now.

  • Only look at the roots of these trees.

  • And there's now four trees, one of which has children, three of which don't.

  • So, now look at the smallest roots now and you can go left to right here.

  • There's a 20%, there's a 15%, there's a 20%, and a 45%.

  • So, I'm not sure which one to go with, so you just

  • have to come up with some rule to be consistent.

  • I'm going to go with the ones on the left,

  • and so I'm going to combine the 20% with the 15%, the 20% on the left.

  • Combine their weights.

  • That gives me 35% in a new root, and again label the left branch 0

  • and the right branch a 1.

  • Now, it's not ambiguous.

  • Let's combine 35% and 20% with a new root that's 55%.

  • Call its left branch 0, its right branch 1.

  • And now 55% and 45%, combine those and give us a 1.

  • So why did I just do this?

  • Well now I have built up the so-called Huffman tree for this input text

  • and Huffman proposed the following.

  • To figure out what patterns of zeros and ones

  • to use to represent A, B, C, D, E, simply

  • follow the paths from the root to each of those leaves.

  • So what is the encoding for A?

  • Start at the root and then look for A-- 0, 1, so 0,

  • 1 shall be my binary encoding for A in this world.

  • How about B?

  • 0, 0, 0, 0 shall be my binary encoding for B. How about C?

  • 0, 0, 0, 1 shall be my encoding for C. How about D?

  • 0, 0, 1.

  • And beautifully, how about E?

  • 1.

  • So, to summarize, what has just happened?

  • E was the most popular letter, B and C, were the least popular letters.

  • And if we summarize these, you'll see that, indeed,

  • B and C got pretty long encodings, but E got the shortest encoding.

  • And it turns out mathematically you will now

  • have a system for encoding letters of the alphabet that is optimal,

  • that is you will use as few bits as possible

  • because you are biasing things toward short representations

  • for popular letters, longer representations for less

  • popular letters.

  • And mathematically this gives you the most efficient encoding

  • for the original text without losing any information.

  • In other words, now if Huffman wanted to send a secret message

  • to someone in class or over the internet, he and that recipient

  • simply have to agree on this scheme in advance

  • and then use these encoding to transmit those messages.

  • Because when someone receives 0, 1 or 0, 0, 0, 0 or 0, 0, 0, 1 from Mr. Huffman,

  • they can use that same look-up table, if you will,

  • and say, oh, he just sent me an A or, oh, he just sent me a B or C. So,

  • you have to know what tree Huffman built up.

  • And, indeed, what typically happens in actual computers is

  • when you use Huffman coding to compress some body of text

  • like we just have here, you store the compressed text

  • by storing your As, Bs, Cs, Ds, and Es and other letters

  • using these new encoding, but you somehow

  • have to embed in that file in the compressed file the tree itself

  • or this cheat sheet of encodings.

  • So, with compression-- maybe you're compressing a Microsoft Word

  • file, or a dot TXT file, or any other type of file,

  • you have to store not just the compressed text using these shorter

  • representation-- not 8-bit ASCII, but these shorter representations-- but you

  • also somewhere, maybe at the beginning of the file or at the end of the file,

  • somewhere where someone else can find it, you need to store this mapping

  • or you need to store the tree itself in some digital form.

  • And so, it's possible by this logic that you

  • might try to compress a really small file,

  • and that file could actually become bigger

  • because you're storing a tree inside the file to--

  • with which to recover the original information.

  • Or better yet, most algorithms or most actual

  • compression programs will realize, wait a minute, if compressing this file

  • is actually going to make it bigger, let's just not compress it at all

  • and leave it alone untouched.

  • So, what if you take a compressed file and compress it again, and compress it

  • again, and compress it again?

  • A dangerous assumption to get into is, well, I could just maybe keep

  • compressing that video file again, and again, and again, and again,

  • and I can maybe compress my big essay, or my big video file,

  • or big music file to just maybe one bit.

  • Right?

  • That's the logical extreme, just keep compressing,

  • compressing, compressing, compressing.

  • But, of course, that can't possibly make sense,

  • because if you compress some file down to just a single bit, 0 or 1,

  • you've clearly thrown away information and can't possibly recover it all.

  • So, at some point, too, you've hit this lower bound on the size of the file

  • until you need to start throwing actual information away.

  • At some point, the file just has so much entropy, appears to be so random,

  • there really is no pattern to start to leverage to compress it further.

  • And so, there generally is some maximum amount of compression

  • you can apply to something.

  • So, how would we represent this?

  • Let's whip out a C struct here.

  • So, this time each of the nodes in a Huffman tree

  • need a little something different.

  • They need, at least in the leaves, some kind of character

  • to remember the symbol.

  • Now, technically only the leaves need to know what symbols they are,

  • so it's a little redundant to have this in every node,

  • but we can keep things simple and use the same type of node for everything.

  • Float frequency, I could use an integer and treat it exactly as a percentage,

  • or I can use a float as the nodes were with 0.1 and 0.45 and so forth,

  • and I'll call that frequency.

  • And then each of those nodes needs a left child potentially

  • and a right child potentially.

  • And, again, I'll call these things a node.

  • So, again, it's getting a little more involved this node, but it still allows

  • me to represent it ultimately in C.

  • And now, it's time to pursue lastly the holy grail of data structures,

  • if you will.

  • Thus far, we've been solving problems, creating new problems,

  • trying to solve those again.

  • And the problems we've been exploring this week are things like dynamism,

  • if we want to be able to grow or shrink our data structure.

  • Malloc and pointers give us that flexibility

  • but might cost us a bit more time, because we

  • have to keep things sorted differently or we

  • have to follow all of those pointers.

  • And so, a lot of the algorithms we've been discussing today

  • at least have-- like linear time, searching, or inserting, or deleting

  • potentially like in a linked list.

  • Better still would be something logarithmic

  • like a balanced binary search tree, so still preserving that nice binary

  • aspect from week zero.

  • But the holy grail of a data structure for its operations

  • is Big O of 1 so to speak, constant time.

  • If you are searching, or inserting, or deleting, and somehow changing

  • a data structure, wouldn't it be amazing if every darn operation

  • takes just one step, or maybe two steps, or three

  • steps but a constant number of steps?

  • Now, it might be a little naive for us to expect

  • that we can store an arbitrary amount of data in some fancy way

  • that we get constant time, but maybe just maybe if we're

  • clever we can get close to that.

  • So, let's introduce a step toward that.

  • It turns out there exists in this world things called hash tables.

  • And a hash table can be implemented in any number of ways,

  • but you can think of it really as just an array.

  • So, for instance, this might be a way of representing a hash table called table,

  • whose first location is bracket zero and whose last location is bracket

  • n minus 1 for however long this is.

  • And I just left it as blanks.

  • I don't even know what this hash table might want to store.

  • It could be numbers, it could be names, it could be letters,

  • it could be anything we want.

  • But hash table has this nice theoretical property

  • that if well-designed and thought through,

  • you can maybe just maybe get constant look up time in it.

  • And let's do a simple example of a hash table.

  • Hash tables are often nicely thought of as buckets,

  • so we borrowed these from the loading dock outside just a little moment ago,

  • and we've attached thanks to Arturo some of these signs to them.

  • This is going to be Z, so I'll just put this over here.

  • This is going to be C, so I'll put this over here, and B here, and A.

  • And we thought we might get chased away by the folks on the loading dock,

  • so we didn't bother getting D through Y, So we'll just pretend

  • that we have 26 such buckets here.

  • And suppose that the goal at hand is-- I don't know,

  • it's like at the end of an exam, so we've

  • got our old blue books that a class might use for students

  • writing essays in some class.

  • And it's time for the students to come submit their blue books.

  • Now, we could just collect them all and make a big mess as would generally

  • be the case, or we can be a little more methodical to at least make

  • our jobs easier.

  • Now, at the end of the day, what's going to be interesting about hash tables

  • is that there's going to be this distinction

  • between actual benefits and theoretical benefit, or lack thereof.

  • So, we'll come to that in just a moment, but here's A, B, C, D, and Z.

  • And you know what?

  • I just am going to ask the students in this class-- there are so

  • many people in the room after an exam, I just

  • want them to at least make my life 1/26 as difficult

  • by putting all the As over there, all the Bs here, all the Cs here,

  • all the Zs here, so that I don't have a massive mountain of As through Zs

  • that I have to sift through individually.

  • It would just be nice if they do the first pass

  • of bucketizing the values based on the first letter in their last name.

  • In other words, my hash function, my algorithm,

  • is going to be for each student to consider his or her last name,

  • look at the first letter they're in, and put his or her exam

  • in the appropriate bucket.

  • So, here is, for instance, someone with the letter

  • C. I'm going to put that blue book in here.

  • Here's someone with the letter A. That one's going to go here.

  • Letter Z?

  • This one's going to go over here.

  • Letter B?

  • This is going to go over here.

  • C, and B, and F-- Z, I mean, and all of [? the ?] [? letters ?] of the alphabet

  • in between.

  • So, hashing really has this visual and conceptual equivalence

  • of putting something in this bucket, putting something in that bucket,

  • putting something in this other bucket, ultimately

  • bucketizing all of your elements.

  • And you can think of this, frankly, as just an array,

  • but it's not just an array with one spot.

  • It looks I can stack multiple numbers or multiple blue books inside

  • of that array.

  • So, we're going to have to come back to that, because this clearly

  • can't be an array.

  • Normally, the array would be filled the moment you put one value in it.

  • But this hashing is the interesting part.

  • The juicy ingredient today is if I take into account as input what it is I'm

  • trying to store, use some piece of that information to decide where to put it,

  • that's an algorithm, because I can repeat that process,

  • so long as it's not random.

  • You go over here, you go over here.

  • That's amazing.

  • Wow, OK, pushing my luck.

  • OK, so I'm not just randomly putting things here.

  • I'm actually giving some thought as to where I'm putting things,

  • and that makes the algorithm deterministic, repeatable, predictable

  • so that if you insert something now, you can absolutely

  • find it if present later.

  • Unfortunately, if our hash table does look

  • like this, just a simple array from bracket 0 to bracket n minus 1 dot,

  • dot, dot in between, and it's just an array

  • for integers or an array for strings or whatever, once you put something here,

  • or here, or here, that's it.

  • There is no more room to put another element there

  • wide as I might have drawn this table.

  • If there's an int there, that's it.

  • So, what could you do?

  • Suppose that you do have an array structure like this,

  • and that is unacceptable.

  • You have a whole bunch of elements here and this table looks like this,

  • and you consider this table like this.

  • And maybe it's just where you're supposed to take attendance or put

  • people's names.

  • So, if you say, oh, Alice is here today.

  • Let me go ahead and hash on Alice's name and put her where the As should go.

  • Oh, Zoe is here, Z-O-E, so we'll put her down there.

  • And then who else?

  • Alex is here.

  • Dammit, Alex, no room for you in our hash table,

  • because Alice is already there.

  • This is stupid.

  • If we have data we want to insert into this data structure,

  • it would seem that I have 24 available spots into which I could put Alex

  • and yet I'm just stubbornly trying to put him where only the As belong.

  • So, why don't I, in this kind of scenario, I need to put Alex in here.

  • I clearly have space.

  • You know what?

  • Let me just probe the array looking for the first available spot.

  • OK, Alex, you're just going to go here, and if someone else like Erin appears,

  • fine.

  • You just are going to go over here.

  • So, you try to put the letter As where you want them to go,

  • but if there's already someone there, just

  • probe deeper into the data structure looking for the first available slot.

  • So, this is a general technique in programming called linear probing

  • whereby you have a data structure.

  • If you hash to some location like the letter A there's a collision,

  • something is there, you probe further in the data structure just

  • looking for some place you can put it.

  • So, you get close to constant time decision-making.

  • Put A here, put Z here.

  • And because this is an array, you have random access with your square bracket

  • notation, but if you have lots of As and not too many Zs, or Bs, or Ds,

  • it's possible this approach could devolve back into linear time.

  • So, in the ideal we have one A, one B, one Z, and everything

  • in between, that's constant time.

  • We have our holy grail, constant time operations for a data structure,

  • but not if we want to support insertion of other elements,

  • even those that hash to the same location.

  • So, what's the fix?

  • Well, if the problem is that we've already

  • made room-- we already have used this space for Alice, you know what?

  • If we need to put someone else here, why don't we just create

  • dynamically some more space?

  • We have malloc now.

  • We have dynamic memory allocation.

  • Why don't we just extend our data structure laterally, horizontally--

  • artistically here-- so that, yes, you try to go to that first location.

  • But if there's multiple people that are meant to go there,

  • multiple values, go ahead and just link them together,

  • thereby merging the idea of a hash table and a linked list with a data structure

  • that might look like this.

  • So, this is an example, somewhat arbitrary, of 31 days out of a month.

  • And if you actually hash on people's birth dates,

  • as I think this author did, you can think of your hash table

  • still as an array.

  • But that array does not store strings, it does not store integers.

  • It only stores pointers, 31 total in this case-- some of which

  • might be null, per the vertical diagonal slash--

  • but those pointers in turn point to the beginning of linked lists.

  • So, if multiple people were born on the fourth of some month,

  • you would put J. Adams and W. Floyd in a linked list at that location.

  • If both Aaron, and Alex, and Alice, and other students with the names A

  • all belong at that first location in my previous table, that's fine.

  • Just string them together with a linked list.

  • Much like with these buckets, at the end of the day, I'm still creating piles.

  • And at the end of the day, I still have to go through them all, ultimately.

  • But each of these piles is 1/26 the size of it

  • would have been if everyone just came up at the end of the exam

  • and just piled all their books in the same pile.

  • So, whereas, these algorithms at the end of the day

  • are still devolving, if you will-- or these data structures

  • are devolving, if you will, into linear time operations,

  • in the worst case if these things just get really long and stringy,

  • at least in actuality they might be as good as 1/31 as long or 1/26 as tall.

  • And so, now there's this dichotomy in this week five

  • of asymptotic running time, the theoretical running

  • time that we've really been belaboring and the actual running time.

  • Just because something is n squared does not mean it's bad.

  • If there's only a few elements, n squared is great.

  • It's going to happen super fast if your computer is 1 gigahertz,

  • or 2 gigahertz, or faster these days.

  • N squared in and of itself isn't bad.

  • It just gets really bad when your data gets large.

  • But in practice, even n squared divided by 2 is actually better than n squared.

  • So, a couple weeks ago when I was saying don't worry about the lower order

  • terms, the constant terms, focus only on n squared

  • and not n or anything you're dividing by, that's fine theoretically,

  • but in actuality you're going to feel that kind of difference.

  • So, here's one last data structure that we'll call a trie-- so trie,

  • short for retrieval somehow, T-R-I-E, but pronounced try.

  • And this one is cool because this now is really

  • like a weird offspring of these data structures from today.

  • But it's a tree each of whose nodes is in an array.

  • And a trie is really good for storing words like words in a dictionary.

  • Indeed, one of the problem I had for you in CS50

  • is going to be to implement a spell checker, which effectively means build

  • a dictionary in memory, and you'll be challenged

  • to spell check words as fast as you can, storing

  • as many as 100,000 English words somehow in your computer's memory

  • and answering questions of the form is this a word, is this a word,

  • is this a word.

  • That's, after all, what spell checking is.

  • So, a trie is kind of interesting in that-- and this is an excerpt of an,

  • artist's rendition there of-- the root node

  • here represents this-- is this rectangle here, and that of course

  • looks like an array.

  • And notice what's implicit in this.

  • If this is location A and this is location Z,

  • the author here has just decided to only show you

  • those letters that matter for the sake of discussion.

  • But the fact that the M location here is not blank

  • means there's a pointer there.

  • Indeed, what are these arrays?

  • They are arrays of pointers to other nodes.

  • So, the fact that M is not null and it leads to this node, and notice that A

  • is not null and it leads to this node, and then this node, and then this node.

  • And this is where the artist is just taking some liberties.

  • This tree would be monstrously wide, because all of these arrays

  • are so darn wide, so he or she is just showing you the width-- or the element

  • that we care about, M, A, X, W, E, L, L, and then some special sentinel symbol

  • delta, but it could be anything.

  • This is null, really.

  • This is how using a trie a programmer could store the name Maxwell,

  • M-A-X-W-E-L-L, by simply leaving little bread crumbs, if you will,

  • from one node to another such that each of those elements in the array is

  • a pointer to another array.

  • And if you keep following these pointers, following the bread crumbs

  • and you eventually find yourself at this special sentinel value--

  • and actually, it wouldn't be null, it would be like a Boolean saying true.

  • This is a word you can just by storing a single yes or no at this location way

  • down here, implicitly reveal that M-A-X-W-E-L was in fact a word.

  • Let's follow another.

  • So, let's say Turing, T-U-R-I-N-G, check, Boolean true.

  • Turing is in this dictionary as well.

  • So, if there are bread crumbs that lead to null, that word is not in here.

  • So, apparently there is no names starting with A through L,

  • and there is no one after U through Z or some of the letters in between,

  • because those pointers are implicitly and pictorially null.

  • But let's consider, then, what is the running time of inserting or looking up

  • a name and [? in a trie? ?] Thus far, pretty much all of the data

  • structures we've talked about have pretty slow running times,

  • linear in the worst case.

  • So, if we used an array to store people's names

  • or we used to linked list to store people's names, in the worst case

  • we had linear running time, unless maybe we sort things, but even

  • then that costs us some time.

  • So, linear may be logarithmic was the best we could do.

  • And even with a hash table, whereby, maybe we

  • store Maxwell at the M location in our table,

  • he might still have a link list of a whole bunch of other M people.

  • That, again, can devolve into something linear, a linear linked list.

  • But what about a hash table?

  • To answer the question is Maxwell in a trie-- sorry, what about to trie?

  • To answer the question is Maxwell in a trie, what do we do?

  • We start at the root and we follow the pointer that represents m,

  • and then we follow the pointer there that represents A, then X, W, E, L, L,

  • and we look for at the end of that series of steps a true false value.

  • And if it's true, yes, Maxwell is here.

  • What about Turing?

  • Well, we start at the pointer that represents

  • T, then U, R, I, N G, then check.

  • Oh, true.

  • Turing is in there.

  • Let's look for David.

  • No, false.

  • There's not even a pointer there.

  • David is not in this dictionary.

  • So, how many steps did that each take?

  • To tell whether Maxwell was in the dictionary,

  • was M-A-X-W-E-L-L and then look at the Boolean, so that was eight steps.

  • And to look up Turing was T-U-R-I-N-G. And then that Boolean,

  • that was seven steps.

  • Those numbers have nothing to do with how many words are already in the trie.

  • There might be-- and there's only a couple dozen here--

  • there are a dozen or so here-- there might be thousands of actual words

  • in this dictionary, but we're still going to find Alan Turing by way

  • of T-U-R-I-N-G Boolean seven steps, and M-A-X-W-E-L-L Boolean, eight steps.

  • It doesn't matter how many other data elements are in this trie.

  • And that's what's powerful, because if there

  • is an upper bound on the number of letters in an English

  • word-- which is kind of true.

  • I've rarely typed words that are longer than I don't

  • know 10 characters, 15 characters.

  • At some point there might exist these words,

  • but no one actually says or types these words.

  • Those are effectively constants.

  • The maximum length of a word in English is surely some constant,

  • because there is one word that's the longest.

  • That's a constant value, which means inserting a name,

  • or searching for a name, or removing a name from a trie

  • does depend on the length of the name, but it does not

  • depend on how many pieces of data are already in the data structure.

  • And as such, it is constant time.

  • So, now in C, we have a whole bunch of new syntax

  • with which to represent data structures, namely actual structs in C,

  • and we have pointers, and we have malloc with which

  • we can build more interesting shapes, if you will, in memory.

  • And we now have a number of abstract data types and actual data structures

  • we can build using these ingredients with which we can now solve problems

  • that are going to demand all the more resources, all the more time, all

  • the more space, in which case efficiency and good design

  • is going to be ever more important.

  • All this and more next time.

  • AUDIENCE: She thought she was doing the right thing.

  • [AUDIO PLAYBACK]

  • -Tell me more.

  • -David was sure it had to be the muppet, something called muppet mode,

  • but the pressure was too much.

  • -This is Mario in muppet mode.

  • Take 23.

  • [HONKING]

  • -What's happening?

  • I thought this is what you always wanted,

  • to star in the walkthrough videos, to have all of YouTube's eyes

  • watching you.

  • [HONKING]

  • -Yes, you are.

  • You have to be.

  • Now stand up straight, tuck in your shirt, look into the camera!

  • Take it again from the top.

SPEAKER 1: All right, this is CS50 and this is week five.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it