Placeholder Image

Subtitles section Play video

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high quality educational resources for free.

  • To make a donation or view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • PROFESSOR: So today, we're going to talk about the probability

  • that a random variable deviates by a certain amount

  • from its expectation.

  • Now, we've seen examples where a random variable

  • is very unlikely to deviate much from its expectation.

  • For example, if you flip 100 mutually

  • independent fair coins, you're very

  • likely to wind up with close to 50

  • heads, very unlikely to wind up with 25 or fewer

  • heads, for example.

  • We've also seen examples of distributions

  • where you are very likely to be far from your expectation,

  • for example, that problem when we had the communications

  • channel, and we were measuring the latency of a packet

  • crossing the channel.

  • There, most of the time, your latency

  • would be 10 milliseconds.

  • But the expected latency was infinite.

  • So you're very likely to deviate a lot from your expectation

  • in that case.

  • Last time, we looked at the variance.

  • And we saw how that gave us some feel for the likelihood

  • of being far from the expectation--

  • high variance meaning you're more likely to deviate

  • from the expectation.

  • Today, we're going to develop specific tools

  • for bounding or limiting the probability you

  • deviate by a specified amount from the expectation.

  • And the first tool is known as Markov's theorem.

  • Markov's theorem says that if the random variable is always

  • non-negative, then it is unlikely to greatly exceed

  • its expectation.

  • In particular, if R is a non-negative random variable,

  • then for all x bigger than 0, the probability

  • that R is at least x is at most the expected value of R,

  • the mean, divided by x.

  • So in other words, if R is never negative-- for example,

  • say the expected value is smaller.

  • Then the probability R is large will be a small number.

  • Because I'll have a small number over a big number.

  • So it says that you are unlikely to greatly exceed

  • the expected value.

  • So let's prove that.

  • Now, from the theorem of total expectation

  • that you did in recitation last week,

  • we can compute the expected value of R

  • by looking at two cases-- the case when R is at least x,

  • and the case when R is less than x.

  • That's from the theorem of total expectation.

  • I look at two cases.

  • R is bigger than x.

  • Take the expected value there times the probability

  • of this case happening plus the case when R is less than x.

  • OK, now since R is non-negative, this is at least 0.

  • R can't ever be negative.

  • So the expectation can't be negative.

  • A probability can't be negative.

  • So this is at least 0.

  • And this is trivially at least x.

  • Because I'm taking the expected value of R

  • in the case when R is at least x.

  • So R is always at least x in this case.

  • So its expected value is at least x.

  • So that means that the expected value of R

  • is at least x times the probability

  • R is greater than x, R is greater or equal to x.

  • And now I can get the theorem by just dividing by x.

  • I'm less than or equal to the expected value

  • of R divided by x.

  • So it's a very easy theorem to prove.

  • But it's going to have amazing consequences

  • that we're going to build up through a series of results

  • today.

  • Any questions about Markov's theorem and the proof?

  • All right, there's a simple corollary, which is useful.

  • Again, if R is a non-negative random variable,

  • then for all c bigger than 0, the probability

  • that R is at least c times its expected value is at most 1

  • and c.

  • So the probability you're twice your expected value

  • is at most 1/2.

  • And the proof is very easy.

  • We just set x to be equal to c times the expected

  • value of R in the theorem.

  • So I just plug in x is c times the expected value of R.

  • And I get expected value of R over c times the expected value

  • of R, which is 1/c.

  • So you just plug in that value in Markov's theorem,

  • and it comes out.

  • All right, let's do some examples.

  • Let's let R be the weight of a random person

  • uniformly selected.

  • And I don't know what the distribution of weights

  • is in the country.

  • But suppose that the expected value of R,

  • which is the average weight, is 100 pounds.

  • So if I average over all people, their weight is 100 pounds.

  • And suppose I want to know the probability

  • that the random person weighs at least 200 pounds.

  • What can I say about that probability?

  • Do I know it exactly?

  • I don't think so.

  • Because I don't know what the distribution of weights is.

  • But I can still get an upper bound on this probability.

  • What bound can I get on the probability

  • that a random person has a weight

  • of 200 given the facts here?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yes, well, it's 100 over 200, right.

  • It's at most the expected value, which is 100,

  • over the x, which is 200.

  • And that's equal to 1/2.

  • So the probability that a random person weighs 200 pounds

  • or more is at most 1/2.

  • Or I could plug it in here.

  • The expected value is 100.

  • 200 is twice that.

  • So c would be 2 here.

  • So the probability of being twice the expectation

  • is at most 1/2.

  • Now of course, I'm using the fact

  • that weight is never negative.

  • That's obviously true.

  • But it is implicitly being used here.

  • So what fraction of the population

  • now can weigh at least 200 pounds?

  • Slightly different question.

  • Before I asked you, if I take a random person,

  • what's the probability they weigh at least 200 pounds?

  • Now I'm asking, what fraction of the population

  • can weigh at least 200 pounds if the average is 100?

  • What is it?

  • Yeah?

  • AUDIENCE: At most 1/2.

  • PROFESSOR: At most 1/2.

  • In fact, it's the same answer.

  • And why?

  • Why can't everybody weigh 200 pounds,

  • so it would be all the population

  • weighs 200 pounds at least?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Probability would be 1, and that can't happen.

  • And in fact, intuitively, if everybody

  • weighs at least 200 pounds, the average

  • is going to be at least 200 pounds.

  • And we said the average was 100.

  • And this is illustrating this interesting thing

  • that probability implies things about averages and fractions.

  • Because it's really the same thing in disguise.

  • The connection is, if I've got a bunch of people, say,

  • in the country, I can convert a fraction

  • that have some property into a probability

  • by just selecting a random person.

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: No, the variance could be very big.

  • Because I might have a person that weighs a million pounds,

  • say.

  • So you have to get into that.

  • But it gets a little bit more complicated.

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: No, there's nothing being

  • assumed about the distribution, nothing at all, OK?

  • So that's the beauty of Markov's theorem.

  • Well, I've assumed one thing.

  • I assume that there is no negative values.

  • That's it.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: That's correct.

  • They can distribute it any way with positive values.

  • But we have a fact here we've used, that the average was 100.

  • So that does limit your distribution.

  • In other words, you couldn't have a distribution where

  • everybody weighs 200 pounds.

  • Because then the average would be 200, not 100.

  • But anything else where they're all positive

  • and they average 100, you know that at most half can be 200.

  • Because if you pick a random one,

  • the probability of getting one that's 200 is at most 1/2,

  • which follows from Markov's theorem.

  • And that's partly why it's so powerful.

  • You didn't know anything about the distribution, really,

  • except its expectation and that it was non-negative.

  • Any other questions about this?

  • I'll give you some more examples.

  • All right, here's another example.

  • Is it possible on the final exam for everybody in the class

  • to do better than the mean score?

  • No, of course not.

  • Because if they did, the mean would be higher.

  • Because the mean is the average.

  • OK, let's do another example.

  • Remember the Chinese appetizer problem?

  • You're at the restaurant, big circular table.

  • There's n people at the table.

  • Everybody has one appetizer in front of them.

  • And then the joker spins the thing

  • in the middle of the table.

  • So it goes around and around.

  • And it stops in a random uniform position.

  • And we wanted to know, what's the expected number of people

  • to get the right appetizer back?

  • What was the answer?

  • Does anybody remember?

  • One.

  • So you expect one person to get the right appetizer back.

  • Well, say I want to know the probability that all n people

  • got the right appetizer back.

  • What does Markov tell you about the probability

  • that all n people get the right appetizer back?

  • 1/n.

  • The expected value is 1.

  • And now you're asking the probability

  • that you get R is at least n.

  • So x is n.

  • So it's 1 in n.

  • And what was the probability, or what is the actual probability?

  • In this case, you know the distribution,

  • that everybody gets the right appetizer back, all n.

  • 1 in n.

  • So in the case of the Chinese appetizer problem,

  • Markov's bound is actually the right answer, right on target,

  • which gives you an example where you can't improve it.

  • By itself, if you just know the expected value,

  • there's no stronger theorem that way.

  • Because Chinese appetizer is an example where the bound

  • you get, 1/n, of n people getting the right appetizer

  • is in fact the true probability.

  • OK, what about the hat check problem?

  • Remember that?

  • So there's n men put the hats in the coat closet.

  • They get uniformly randomly scrambled.

  • So it's a random permutation applied to the hats.

  • Now each man gets a hat back.

  • What's the expected number of men to get the right hat back?

  • One, same as the other one.

  • Because you've got n men each with a 1 in n chance,

  • so it's 1.

  • Markov says the probability that n men get the right hat back is

  • at most 1 in n, same as before.

  • What's the actual probability that all n

  • men get the right hat back?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: 1 in n factorial.

  • So in this case, Markov is way off the mark.

  • It says 1 in n.

  • But in fact the real bound is much smaller.

  • So Markov is not always tight.

  • It's always an upper bound.

  • But it sometimes is not the right answer.

  • And to get the right answer, often you

  • need to know more about the distribution.

  • OK, what if R can be negative?

  • Is it possible that Markov's theorem holds there?

  • Because I use the assumption in the theorem.

  • Can anybody give me an example where it doesn't

  • work if R can be negative?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, good, so for example,

  • say probability R equals 1,000 is 1/2,

  • and the probability R equals minus 1,000 is 1/2.

  • Then the expected value of R is 0.

  • And say we asked the probability that R is at least 1,000.

  • Well, that's going to be 1/2.

  • But that does not equal the expected value of R/1,000,

  • which would be 0.

  • So Markov's theorem really does need that R to be non-negative.

  • In fact, let's see if we saw where we used it in the proof.

  • Anybody see where we use that fact in the proof,

  • that R can't be negative?

  • What is it?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Well, no, because x is positive.

  • We said x is positive.

  • So it's not used there.

  • But that's a good one to look at.

  • Yeah?

  • AUDIENCE: [INAUDIBLE] is greater than or equal to 0.

  • PROFESSOR: Yeah, if R can be negative,

  • then this is not necessarily a positive number.

  • It could be a negative number.

  • And then this inequality doesn't hold.

  • OK, good.

  • All right, now it turns out there

  • is a variation of Markov's theorem

  • you can use when R is negative.

  • Yeah.

  • AUDIENCE: [INAUDIBLE] but would it be OK just

  • to shift everything up?

  • PROFESSOR: Yeah, yeah, that's great.

  • If R has a limit on how negative it can be,

  • then you make an R prime, which just adds that limit to R,

  • makes it positive or non-negative.

  • And now use Markov's theorem there.

  • And that is now an analogous form of Markov's theorem

  • when R can be negative, but there's a lower limit to it.

  • And I won't stay to improve that here.

  • But that's in the text and something

  • you want to be familiar with.

  • What I do want to do in class is another case

  • where you can use Markov's theorem

  • to analyze the probability or upper bound the probability

  • that R is very small, less than its expectation.

  • And it's the same idea as you just suggested.

  • So let's state that.

  • If R is upper bounded, has a hard limit on the upper bound,

  • by u for some u in the real numbers,

  • then for all x less than u, the probability that R

  • is less than or equal to x is at most u

  • minus the expected value of R over u minus x.

  • So in this case, we're getting a probability

  • that R is less than something instead of R

  • is bigger than something.

  • And we're going to do it using a simple trick that we'll be

  • sort of using all day, really.

  • The probability that R is less than x, this event,

  • R is less than x, is the same as the event u minus R

  • is at least u minus x.

  • So what have I done?

  • I put negative R over here, subtract x from each side,

  • add u to each side.

  • I've got to put a less than or equal to here.

  • So R is less than or equal to x if and only

  • if u minus r is at least u minus x.

  • It's simple math there.

  • And now I'm going to apply Markov to this.

  • I'm going to apply Markov to this random variable.

  • And this will be the value I would have had

  • for x up in Markov's theorem.

  • Why is it OK to apply Markov to u minus R?

  • AUDIENCE: You could just define the new random variable

  • to be u minus R.

  • PROFESSOR: Yeah, so I got a new random variable.

  • But what do I need to know about that new random variable

  • to apply Markov?

  • AUDIENCE: u is always greater than R.

  • PROFESSOR: u is always greater than R, or at least as big

  • as R. So u minus R is always non-negative.

  • So I can apply Markov now.

  • And when I apply Markov, I'll get this is at most-- maybe

  • I'll go over here.

  • The probability that-- ooh, not R here.

  • This is probability.

  • The probability that u minus R is at least u minus x

  • is at most the expected value of that random variable

  • over this value.

  • And well now I just use the linearity of expectation.

  • I've got a scalar here.

  • So this is u minus the expected value of R over u minus x.

  • So I've used Markov's theorem to get a different version of it.

  • All right, let's do an example.

  • Say I'm looking at test scores.

  • And I'll let R be the score of a random student

  • uniformly selected.

  • And say that the max score is 100.

  • So that's u.

  • All scores are at most 100.

  • And say that I tell you the class average,

  • or the expected value of R, is 75.

  • And now I want to know, what's the probability

  • that a random student scores 50 or below?

  • Can we figure that out?

  • I don't know anything about the distribution,

  • just that the max score is 100 and the average score is 75.

  • What's the probability that a random student

  • scores 50 or less?

  • I want to upper bound that.

  • So we just plug it into the formula.

  • u is 100.

  • The expected value is 75.

  • u is 100.

  • And x is 50.

  • And that's 25 over 50, is 1/2.

  • So at most half the class can score 50 or below.

  • And state it as a probability question or deterministic fact

  • if I know the average is 75 and the max is 100.

  • Of course, another way of thinking about that

  • is if more than half the class scored 50

  • or below, your average would have

  • had to be lower, even if everybody else was

  • right at 100.

  • It wouldn't average out to 75.

  • All right, any questions about that?

  • OK, so sometimes Markov is dead on right,

  • gives the right answer.

  • For example, half the class could have scored 50,

  • and half could have gotten 100 to make it be 75.

  • And sometimes it's way off, like in the hat check problem.

  • Now, if you know more about the distribution,

  • then you can get better bounds, especially the cases

  • when you're far off.

  • For example, if you know the variance in addition

  • to the expectation, or aside from the expectation,

  • then you can get better bounds on the probability

  • that the random variable is large.

  • And in this case, the result is known as Chebyshev's theorem.

  • I'll do that over here.

  • And it's the analog of Markov's theorem based on variance.

  • It says, for all x bigger than 0,

  • and any random variable R-- could even be negative--

  • the probability that R deviates from its expected value

  • in either direction by at least x

  • is at most of the variance of R divided by x squared.

  • So this is like Markov's theorem,

  • except that we're now bounding the deviation

  • in either direction.

  • Instead of expected value, you have variance.

  • Instead of x, you've got x squared, but the same idea.

  • In fact, the proof uses Markov's theorem.

  • Well, the probability that R deviates

  • from its expected value by at least x,

  • this is the same event, or happens if and only

  • if R minus expected value squared is at least x squared.

  • I'm just going to square both sides here.

  • OK, I square both sides.

  • And since this is positive and this is positive,

  • I can square both sides and maintain the inequality.

  • Now I'm going to apply Markov's theorem

  • to that random variable.

  • It's a random variable.

  • It's R minus expected value squared.

  • So it's a random variable.

  • And what's nice about this random variable that

  • lets me apply Markov's theorem?

  • It's a square.

  • So it's always non-negative.

  • So I can apply Markov's theorem.

  • And my Markov's theorem, this probability

  • is at most the expected value of that divided by this.

  • That's what Markov's theorem says as long as this is always

  • non-negative.

  • All right, what's a simpler expression

  • for this, the expected value of the square of the deviation

  • of a random variable?

  • That's the variance.

  • That's the definition of variance.

  • So that is just the variance of R over x squared.

  • And we're done.

  • So Chebyshev's theorem is really just another version

  • of Markov's theorem.

  • But now it's based on the variance.

  • OK, any questions?

  • OK, so there's a nice corollary for this, just as with

  • Markov's theorem.

  • It says the probability that the absolute value, the deviation,

  • is at least c times the standard deviation of R.

  • So I'm looking at the probability

  • that R differs from its expectation

  • by at least some scalar c times the standard deviation.

  • Well, what's that?

  • Well, that's the variance of R over the square of this thing--

  • c squared times the standard deviation squared.

  • What's the square of the standard deviation?

  • That's the variance.

  • They cancel, so it's just 1 over c squared.

  • So the probability of more than twice the standard deviation

  • off the expectation is at most 1/4, for example.

  • All right, let's do some examples of that.

  • Maybe we'll leave Markov up there.

  • OK, say we're looking at IQs.

  • In this case, we're going to let R be the IQ of a random person.

  • All right, now we're going to assume--

  • and this actually is the case-- that R is always at least 0,

  • despite the fact that probably most of you

  • have somebody you know who you think has a negative IQ.

  • They can't be negative.

  • They have to be non-zero.

  • In fact, IQs are adjusted.

  • So the expected IQ is supposed to be 100,

  • although actually the averages may be in the 90's.

  • And it's set up so that the standard deviation of IQ

  • is supposed to be 15.

  • So we're just going to assume those are facts on IQ.

  • And that's what it's meant to be.

  • And now we want to know, what's the probability a random person

  • has an IQ of at least 250?

  • Now Marilyn, from "Ask Marilyn," has an IQ pretty close to 250.

  • And she thinks that's pretty special, pretty rare.

  • So what can we say about that?

  • In particular, say we used Markov.

  • What could you say about the probability of having

  • an IQ of at least 250?

  • What does Markov tell us?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: What is it?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Not quite 1 in 25, but you're on the right track.

  • It's not quite 2/3.

  • It's the expected value, which is 100,

  • over the x value, which is 250.

  • So it's 1 in 2.5, or 0.4.

  • So the probability is at most 0.4,

  • so 40% chance it could happen, potentially, but no

  • bigger than that.

  • What about Chebyshev?

  • See if you can figure out what Chebyshev

  • says about the probability of having an IQ of at least 250.

  • It's a little tricky.

  • You've got to sort of plug it into the equation

  • there and get it to fit in the right form.

  • Chebyshev says that-- let's get in the right form.

  • I've got probability that R is at least 250.

  • I've got to get it into that form up there.

  • So that's the probability that-- well, first R minus 100

  • is at least 150.

  • So I've got R minus the expected value.

  • I'm sort of getting it ready to apply Chebyshev here.

  • And then 150-- how many standard deviations is 150?

  • 10, all right?

  • So this is the probability that R minus the expected value of R

  • is at least 10 standard deviations.

  • That's what I'm asking.

  • I'm not quite there.

  • I'm going to use the corollary there.

  • But I've got to get that absolute value thing in.

  • But it's upper bounded by the probability

  • of the absolute value of R minus expected value

  • bigger than or equal to 10 standard deviations.

  • Because this allows for two cases.

  • R is 10 standard deviations high,

  • and R is 10 standard deviations low or more.

  • So this is upper bounded by that.

  • And now I can plug in Chebyshev in the corollary form.

  • And what's the answer when I do that?

  • 1 in 100-- the probability of being off

  • by 10 standard deviations or more is at most 1 in 100,

  • 1 in 10 squared.

  • So it's a lot better bound.

  • It's 1% instead of 40%.

  • So knowing the variance of the standard deviation

  • gives you a lot more information and generally gives you

  • much better bounds on the probability

  • of deviating from the mean.

  • And the reason it gives you better bounds

  • is because the variance is squaring deviations.

  • So they count a lot more.

  • All right, now let's look at this step a little bit more.

  • All right, let's say here is a line,

  • and here's the expected value of R.

  • And say here's 10 standard deviations high here.

  • So this will be more than 10 standard deviations.

  • And this will be 10 standard deviations on the low side.

  • So here, I'm low.

  • Now, this line here with the absolute value

  • is figuring out the probability of being low or high.

  • This is the probability that the absolute value

  • of R minus its expected value is at least

  • 10 standard deviations.

  • What we really wanted to know for bound

  • was just the high side.

  • Now, is it true that then, since the probability of high or low

  • is 1 in 100, the probability of being high is at most 1

  • in 200, half?

  • Is that true?

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, it is not necessarily true

  • that the high and the low are equal,

  • and therefore the high is half the total.

  • It might be true, but not necessarily true.

  • And that's a mistake that often gets

  • made where you'll take this fact as being less than 100

  • to conclude that's less than 1 in 200.

  • And that you can't do, unless the distribution is

  • symmetric around the expected value.

  • Then you could do it, if it's a symmetric distribution

  • around the expected value.

  • But usually it's not.

  • Now, there is something better you can say.

  • So let me tell you what it is.

  • But we won't prove it.

  • I think we might prove it in the text.

  • I'm not sure.

  • If you just want the high side or just want the low side,

  • you can do slightly better than 1 in c squared.

  • That's the following theorem.

  • For any random variable R, the probability

  • that R is on the high side by c standard deviations

  • is at most 1 over c squared plus 1.

  • So it's not 1 over 2c squared.

  • It's 1 over c squared plus 1, and the same thing

  • for the probability of being on the low side.

  • Let's see, have I written this right?

  • Hmm, I want to get this as less than or equal to negative

  • c times the standard deviation.

  • So here I'm high by c or more standard deviations.

  • Here I'm low.

  • So R is less than the expected value by at least

  • c standard deviations.

  • And that is also 1 over c squared plus 1.

  • And it is possible to find distributions

  • that hit these targets-- not both at the same time,

  • but one or the other, hit those targets.

  • So that's the best you can say in general.

  • All right, so using this bound, what's

  • the probability that a random person has

  • an IQ of at least 250?

  • It's a little better than 1 in 100.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, 1/101.

  • So in fact, the best we can say without knowing any more

  • information about IQs is that it's at most 1/101,

  • slightly better.

  • Now in fact, with IQs, they know more about the distribution.

  • And the probability is a lot less.

  • Because you know more about the distribution than

  • we've assumed here.

  • In fact, I don't think anybody has an IQ over 250

  • as far as I know.

  • Any questions about this?

  • OK, all right, say we give the exam.

  • What fraction of the class can score more than two

  • standard deviations, get two standard deviations or more,

  • away from the average, above or below?

  • Could half the class be two standard deviations

  • off the mean?

  • No?

  • What's the biggest fraction that that could happen?

  • What do I do?

  • What fraction of the class can be two standard deviations

  • or more from the mean?

  • What is it?

  • AUDIENCE: 1/4.

  • PROFESSOR: 1/4, because c is 2.

  • You don't even know what the mean is.

  • You don't know what the standard deviation is.

  • You don't need to.

  • I just asked, you're two standard deviations

  • off or more.

  • At most, 1/4.

  • How many could be two standard deviations

  • high or better at most?

  • 1/5-- 1 over 4 plus 1, good.

  • OK, this holds true no matter what

  • the distribution of test scores is.

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Which one?

  • This one?

  • AUDIENCE: Yeah.

  • PROFESSOR: Oh, that's more complicated.

  • That'll take us several boards to do, to prove that.

  • And I forget if we put it in the text or not.

  • It might be in the text, to prove that.

  • Any other questions?

  • OK so Markov and Chebyshev are sometimes close, sometimes not.

  • Now, for the rest of today, we're

  • going to talk about a much more powerful technique.

  • But it only works in a special case.

  • Now, the good news is this special case

  • happens all the time in practice.

  • And it's the case when you're analyzing a random variable

  • that itself is the sum of a bunch

  • of other random variables.

  • And we've seen already examples like that.

  • And the other random variables have

  • to be mutually independent.

  • And in this case, you get a bound

  • that's called a Chernoff bound.

  • And this is the same Chernoff who figured out

  • how to beat the lottery.

  • And it's interesting.

  • Long after we started teaching this,

  • originally this stuff was only taught, for Chernoff bounds,

  • for graduate students.

  • And now we teach it here.

  • Because it's so important.

  • And it really is accessible.

  • It'll be probably the most complicated proof

  • we've done to establish a Chernoff bound.

  • But Chernoff himself, when he discovered this,

  • thought it was no big deal.

  • In fact, he couldn't figure out why

  • everybody in computer science was always writing papers

  • with Chernoff bounds in them.

  • And that's because he didn't put any emphasis

  • on the bounds in his work.

  • But computer scientists who came later

  • found all sorts of important applications.

  • And we'll see some of those today.

  • So let me tell you what the bound is.

  • And the nice thing is it really is

  • Markov's theorem again in disguise,

  • just a little more complicated.

  • Theorem-- it's called a Chernoff bound.

  • Let T1, T2, up to Tn be any mutually independent--

  • that's really important-- random variables

  • such that each of them takes values only between 0 and 1.

  • And if they don't, just normalize them so they do.

  • So we're going to take a bunch of random variables

  • that are mutually independent.

  • And they are all between 0 and 1.

  • Then we're going to look at the sum of those random variables,

  • call that T. Then for any c at least 1,

  • the probability that the sum random variable is at least c

  • times its expected value.

  • So it's going to be the high side here-- is at most e

  • to the minus z, and I'll tell you what that is in a minute,

  • times the expected value of T where z is c natural log of c

  • plus 1 minus c.

  • And it turns out if c is bigger than 1, this is positive.

  • So that's a lot, one of the longest

  • theorems we wrote down here.

  • But what it says is that probability were

  • high is exponentially small.

  • As the expected value is big, the chance of being high

  • gets really, really tiny.

  • Now, I'm going to prove it in a minute.

  • But let's just plug in some examples

  • to see what's going on here.

  • So for example, suppose the expected value of T is 100.

  • And suppose c is 2.

  • So we expect to have 100 come out of the sum.

  • The probability we get at least 200-- well,

  • let's figure out what that is.

  • c being 2 we can evaluate z now.

  • It's 2 natural log of 2 plus 1 minus 2.

  • And that's close to but a little larger than 0.38.

  • So we can plug z in, the exponent up there,

  • and find that the probability that T is at least twice

  • its expected value, namely at least 200,

  • is at most e to the minus 0.38 times 100,

  • which is e to the minus 38, which is just really small.

  • So that's just way better than any results you

  • get with Markov or Chebyshev.

  • So if you have a bunch of random variables between 0 and 1,

  • and they're mutually independent, you add them up.

  • If you expect 100 as the answer, the chance

  • of getting 200 or more-- forget about it, not going to happen.

  • Now, of course Chernoff doesn't apply to all distributions.

  • It has to be this type.

  • This is a pretty broad class.

  • In fact, it contains the class of all Bernoulli distributions.

  • So I have binomial distributions.

  • Because remember a binomial distribution-- well,

  • remember binomial distributions?

  • That's where T is the sum of Ti's.

  • In binomial, you have Tj is 0 or 1.

  • It can't be in between.

  • And with binomial, all Tj's have the same distribution.

  • With Chernoff, they can all be different.

  • So Chernoff is much broader than binomial.

  • The individual guys here can have different distributions

  • and attain values anywhere between 0 and 1,

  • as opposed to just one or the other.

  • Any questions about this theorem and what it says in the terms

  • there?

  • One nice thing about it is the number of random variables

  • doesn't even show up in the answer here.

  • n doesn't even appear.

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Does not apply to what?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, when c equals 1, what happens is z is 0.

  • Because I have a log of 1 is 0, and 1 minus 1 is 0.

  • And if z is 0, it says your probability

  • is upper bounded by 1.

  • Well, not too interesting, because any probability

  • is upper bounded by 1.

  • So it doesn't give you any information

  • when c is 0, none at all.

  • But as soon as c starts being-- sorry, if c is 1.

  • As soon as c starts being bigger than 1, which is sort of a case

  • you're interested in, you're bigger than your expectation,

  • then it gives very powerful results.

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, you can.

  • It's true for n equals 1 as well.

  • Now, it doesn't give you a lot of information.

  • Because if c is bigger than 1 and n was 1, so

  • it's using one variable, what's the probability

  • that a random variable exceeds its expectation, c

  • times its expectation?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, let's see now.

  • Maybe it does give you information.

  • Because the random variable has a distribution on 0, 1.

  • That's right, so it does give you some information.

  • But I don't think it gives you a lot.

  • I have to think about that.

  • What happens when there's just one guy?

  • Because the same thing is true.

  • It's just now for a single random variable on 0,

  • 1 the chance that your twice the expected value.

  • I have to think about that.

  • That's a good question.

  • Does it do anything interesting there?

  • OK, all right, so let's do an example

  • of how you might apply this.

  • Say that you're playing Pick 4, and 10 million people

  • are playing.

  • And say in this version of Pick 4,

  • you're picking a four digit number, four single digits.

  • And you win if you get an exact match.

  • So the probability of winning, a person winning,

  • well, they've got to get all four digits right.

  • That's 1 in 10,000, 10 to the fourth.

  • What's the expected number of winners?

  • If I got 10 million people, what's

  • the expected number of winners?

  • What is it?

  • We've got 10 million over 10,000, right?

  • Because what I'm doing here is the number of winners,

  • T, is going to be the sum of 10 million indicator variables.

  • And the probability that any one of these guys wins

  • is 1 in 10,000.

  • So the expected number of winners

  • is 1 in 10,000 added 10 million times, which is this.

  • Is that OK?

  • Everybody should be really familiar with how

  • to whip these things out.

  • This for sure will have probably at least a couple

  • questions where you're going to need

  • to be able to do that kind of stuff on the final.

  • All right, say I want to know the probability of getting

  • at least 2,000 winners, and I want to upper bound

  • that just with the information I've given you.

  • Well, any thoughts about an upper bound?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: What's that?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Yeah, that's a good upper bound.

  • What did you have to assume to get there?

  • e to the minus 380 is a great bound.

  • Because you're going to plug in expected value is 1,000.

  • And we're asking for more than twice the expected value.

  • So it's e to the minus 0.38 times 1,000.

  • And that for sure is-- so you computed this.

  • And that equals e to the minus 380.

  • So that's really small.

  • But what did you have to assume to apply Chernoff?

  • Mutual independence.

  • Mutual independence of what?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: The numbers people picked.

  • And we already know, if people are picking numbers,

  • they don't tend to be mutually independent.

  • They tend to gang up.

  • But if you had a computer picking the numbers randomly

  • and mutually independently, then you would be e to the minus 380

  • by Chernoff if mutually independent picks.

  • Everybody see why we did that?

  • Because it's a probability of twice your expectation.

  • The total number of winners is the sum of 10 million indicator

  • variables.

  • And indicator variables are 0 or 1.

  • So they fit that definition up there.

  • And so we already figured out z is at least 0.38.

  • And you're multiplying by the expected value of 1,000.

  • That's e to the minus 380, so very, very unlikely.

  • What if they weren't mutually independent?

  • Can you say anything about this, anything at all better than 1,

  • which we know for any probability?

  • Yeah?

  • AUDIENCE: It's possible that everyone

  • chose the same numbers.

  • PROFESSOR: Yes, everyone could have chosen the same number.

  • But that number only comes up with a 1 in 10,000 chance.

  • So you can say something.

  • AUDIENCE: You can use Markov.

  • PROFESSOR: Use Markov.

  • What does Markov give you?

  • What does Markov give you?

  • 1/2, yeah.

  • Because you've got the expected value is

  • 1,000 divided by the bound threshold, is 2,000,

  • is 1/2 by Markov.

  • And that holds true without any independence assumption.

  • Now, there is an enormous difference between 1/2 and e

  • to the minus 380.

  • Independence really makes a huge difference

  • in the bound you can compute.

  • OK, now there's another way we could've gone about this.

  • What kind of distribution does T have in this case?

  • It's binomial.

  • Because it's the sum of indicator random variables, 0,

  • 1's.

  • Each of these is 0, 1.

  • And they're all the same distribution.

  • There's a 1 in 10,000 chance of winning for each one of them.

  • So it's a binomial.

  • So we could have gone back and used

  • the formulas we had for the binomial distribution,

  • plugged it all in, and we'd have gotten something pretty similar

  • here.

  • But Chernoff is so much easier.

  • Remember that pain we would go through

  • with a binomial distribution, the approximation, Stirling's

  • formula, [INAUDIBLE] whatever, the factorials and stuff?

  • And that's a nightmare.

  • This was easy.

  • e to the minus 380 was very easy to compute.

  • And really at that point it doesn't

  • matter if it's minus 381 or minus 382 or whatever.

  • Because it's really small.

  • So often, even when you have a binomial distribution,

  • well, Chernoff will apply.

  • And that's a great way to go.

  • Because it gives you good bounds generally.

  • All right, let's figure out the probability

  • of at least 1,100 winners instead of 1,000.

  • So let's look at the probability of at least 100 extra winners

  • over what we expect out of 10 million.

  • We've got 10 million people.

  • You expect 1,000.

  • We're going to analyze the probability of 1,100.

  • What's c in this case?

  • We're going to use Chernoff.

  • 1.1.

  • So this is 1.1 times 1,000.

  • And that means that z is 1.1 times the natural log

  • of 1.1 plus 1 minus 1.1.

  • And that is close to but at least 0.0048.

  • So this probability is at most, by Chernoff,

  • e to the minus 0.0048 times the expected number of winners

  • is 1,000.

  • So that is e to the minus 4.8, which is less than 1%,

  • 1 in 100.

  • So that's pretty powerful.

  • It says, you've got 10 million people who could win.

  • The chance of even having 100 more than the 1,000 you expect

  • is 1% chance at most-- very, very powerful.

  • It says you really expect to get really

  • close to the mean in this situation.

  • OK, a lot better-- Markov here gives you,

  • what, 1,000 over 1,100.

  • It says your probability could be 90% or something--

  • not very useful.

  • Chebyshev won't give you much here either.

  • So if you're in a situation to apply Chernoff,

  • always go there.

  • It gives you the best bounds.

  • Any questions?

  • This of course is why computer scientists use it all the time.

  • OK, actually, before I do more examples,

  • let me prove the theorem in a special case

  • to give you a feel for what's involved.

  • The full proof is in the text.

  • I'm going to prove it in the special case

  • where the Tj are 0, 1.

  • So they're indicator random variables.

  • But they don't have to have the same distribution.

  • So it's still more general than you get

  • with a binomial distribution.

  • All right, so we're going to do a proof of Chernoff

  • for the special case where the Tj are either 0 or 1.

  • So they're indicator variables.

  • OK, so the first step is going to seem pretty mysterious.

  • But we've been doing something like it all day.

  • I'm trying to compute the probability T is bigger

  • than c times its expectation.

  • Well, what I'm going to do is exponentiate both of these guys

  • and compute the probability that c to the T

  • is at least c to the c times the expected value of T.

  • Now, this is not the first thing you'd expect to do,

  • probably, if you were trying to prove this.

  • So it's one of those divine insights

  • that you'd make this step.

  • And then I'm going to apply Markov, like we've

  • been doing all day, to this.

  • Now, since T is positive and c is positive, these are equal.

  • And this is never non-negative.

  • So now by Markov, this is simply upper bounded

  • by the expected value of that, expected value of c to the T,

  • divided by this.

  • And that's by Markov.

  • So everything we've done today is really Markov in disguise.

  • Any questions so far?

  • You start looking at this, you go, oh my god, I

  • got the random variable and the exponent.

  • This is looking like a nightmare.

  • What is the expected value of c to the T,

  • and this kind of stuff?

  • But we're going to hack through it.

  • Because it gives you just an amazingly powerful result

  • when you're done.

  • All right, so we've got to evaluate the expected value

  • of c to the T. And we're going to use the fact that T is

  • the sum of the Tj's.

  • And that means that c to the T equals c to the T1 times

  • c to the T2 times c to the Tn.

  • The weird thing about this proof is that every step sort of

  • makes it more complicated looking

  • until we get to the end.

  • So it's one of those that's hard to figure out the first time.

  • All right, that means the expected value of c to the T

  • is the expected value of the product of these things.

  • Now I'm going to use the product rule for expectation.

  • Now, why can I use the product rule?

  • What am I assuming to be able to do that?

  • That they are mutually independent,

  • that the c to the Tj's are mutually

  • independent of each other.

  • And that follows, because the Tj's are mutually independent.

  • So if a bunch of random variable are mutually independent,

  • then their exponentiations are mutually independent.

  • So this is by product rule for expectation

  • and mutual independence.

  • OK, so now we've got to evaluate the expected value

  • of c to the Tj.

  • And this is where we're going to make

  • it simpler by assuming that Tj is just a 0, 1 random variable.

  • So the simplification comes in here.

  • So the expected value of Tj-- well, there's two cases.

  • Tj is 1, or it's 0.

  • Because we made this simplification.

  • If it's 1, I get c to the 1-- ooh,

  • expected value of c to the Tj.

  • Let's get that right.

  • It could be 1, in which case I get a contribution of c

  • to the 1 times the probability Tj equals 1 plus the case at 0.

  • So I get c to the 0 times the probability Tj is 0.

  • Well, c to the 1 is just c.

  • c to the 0 is 1.

  • And I'm going to rewrite Tj being

  • 0 as 1 minus the probability Tj is 1.

  • All right, this equals that.

  • And of course the 1 cancels.

  • Now I'm going to collect terms here

  • to get 1 plus c minus 1 times the probability Tj equals 1.

  • OK, then I'm going to do one more step here.

  • This is 1 plus c minus 1 times the expected value of Tj.

  • Because if I have an indicator random variable,

  • the expected value is the same as the probability that it's 1.

  • Because in the other case it's 0.

  • And now I'm going to use the trick from last time.

  • Remember 1 plus x is always at most e to the x from last time?

  • None of these steps is obvious why we're doing them.

  • But we're going to do them anyway.

  • So this is at most e to this, c minus 1 expected value of Tj.

  • Because 1 plus anything is at most the exponential of that.

  • And I'm doing this step because I got a product of these guys.

  • And I want to put them in the exponent

  • so I can then sum them so it gets easy.

  • OK, now we just plug this back in here.

  • So that means that the expected value of c to the T

  • is at most a product of expected value of e

  • to the cTj is this-- e to the c minus 1 expected value of Tj.

  • And now I can convert this to a sum in the exponent.

  • And this is j equals 1 to n.

  • And what do I do to simplify that?

  • Linearity of expectation.

  • c minus 1 times the sum j equals 1 to n expected value of Tj.

  • Ooh, let's see, did I?

  • Actually, I used linearity coming out.

  • I already used linearity.

  • I screwed up here.

  • So here I used the linearity when I took the sum up here

  • inside the expectation.

  • I've already used linearity.

  • What is the sum of the Tj's?

  • T-- yeah, that's what I needed to do here.

  • OK, we're now almost done.

  • We've got now an upper bound on the expected value of c

  • to the T. And it is this.

  • And we just plug that in back up here.

  • So now this is at most e to the c minus 1 expected value of T

  • over c to the c times the expected value of t.

  • And now I just do manipulation.

  • c to something is the same as e to the log

  • of c times that something.

  • So this is e to the minus c ln c expected value of T plus that.

  • And then I'm running out of room.

  • That equals-- I can just pull out the expected values of T. I

  • get e to the minus c log of c plus c minus 1 expected

  • value of T. And that's e to the minus z expected value of T.

  • All right, so that's a marathon proof.

  • It's the worst proof I think.

  • Well, maybe minimum spanning tree was worse.

  • But this is one of the worst proofs we've seen this year.

  • But I wanted to show it to you.

  • Because it's one of the most important results

  • that we cover, certainly in probability,

  • that can be very useful in practice.

  • And it gives you some feel for, hey, this

  • wasn't so obvious to do it the first time,

  • and also some of the techniques that are used,

  • which is really Markov's theorem.

  • Any questions?

  • Yeah.

  • AUDIENCE: Over there, you define z as 1 minus c.

  • PROFESSOR: Did I do it wrong?

  • AUDIENCE: c natural log of c, 1 minus c.

  • Maybe it's plus c?

  • PROFESSOR: Oh, I've got to change the sign.

  • Because I pulled a negative out in front.

  • So it's got to be negative c minus 1,

  • which means negative c plus 1.

  • Yeah, good.

  • Yeah, this was OK.

  • I just made the mistake going to there.

  • Any other questions?

  • OK, so the common theme here in using Markov to get Chebyshev,

  • to get Chernoff, to get the Markov extensions,

  • is always the same.

  • And let me show you what that theme is.

  • Because you can use it to get even other results.

  • When we're trying to figure out the probability that T

  • is at least c times its expected value, or actually

  • even in general, even more generally

  • than that, the probability that A is bigger than B,

  • even more generally, well, that's

  • the same as the probability that f of A is bigger than f of B

  • as long as you don't change signs.

  • And then by Markov, this is at most the expected value of that

  • as long as it's non-negative over that.

  • In Chebyshev, what function f did we

  • use for Chebyshev in deriving Chebyshev's theorem?

  • What was f doing in Chebyshev?

  • Actually I probably just erased it.

  • What operation were we doing with Chebyshev?

  • AUDIENCE: Variance.

  • PROFESSOR: Variance.

  • And that meant we were squaring it.

  • So the technique used to prove Chebyshev

  • was f was the square function.

  • For Chernoff, f is the exponentiation function,

  • which turns out to be-- in fact, when we did it

  • for Chernoff, that's the optimal choice of functions

  • to get good bounds.

  • All right, any questions on that?

  • All right, let's do one more example here with numbers.

  • And this is a load balancing application

  • for example you might have with web servers.

  • Say you've got to build a load balancing device,

  • and it's got to balance N jobs, B1, B2,

  • to BN, across a set of M servers, S1, S2, to SN.

  • And say you're doing this for a decent sized website.

  • So maybe N is 100,000.

  • You get 100,000 requests a minute.

  • And say you've got 10 servers to handle those requests.

  • And say the requests are-- the time for the j-th request

  • is, say, Bj takes the j-th job.

  • The j-th request takes Lj time.

  • And the time is the same on any server.

  • The servers are all equivalent.

  • And let's assume it's normalized so that Lj is between 0 and 1.

  • Maybe the worst job takes a second to do, let's say.

  • And say that if you sum up the length of all the jobs,

  • you get L. Total workload is the sum of all of them.

  • j equals 1 to N.

  • And we're going to assume that the average job length is 1/4

  • second.

  • So we're going to assume that the total amount of work

  • is 25,000 seconds, say.

  • So the average job length is 1/4 second.

  • And the job is to assign these tasks to the 10 servers so that

  • hopefully every server is doing L/M work,

  • which would be 25,000/10, or 2,500 milliseconds of work,

  • something like that.

  • I don't know.

  • Because when you're doing load balancing,

  • you want to take your load and spread it evenly and equally

  • among all the servers.

  • Any questions about the problem?

  • You've got a bunch of jobs, a bunch of servers.

  • You want to assign the jobs to the servers

  • to balance the load.

  • Well, what is the simplest algorithm

  • you could think of to do this?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: That's a good algorithm to do this.

  • In practice, the first thing people

  • do is, well, take the first N/M jobs, put them on server one,

  • the next N/M on server two.

  • Or they'll use something called round robin-- first job goes

  • here, second here, third here, 10th here, back and start over.

  • And they hope that it will balance the load.

  • But it might well not.

  • Because maybe every 10th job is a big one.

  • So what's much better to do in practice

  • is to assign them randomly.

  • So a job comes in.

  • You don't even pay attention to how hard

  • it is, how much time you think it'll take.

  • You might not even know before you start the job how long it's

  • going to take to complete.

  • Give it to a random server.

  • Don't even look at how much work that server has.

  • Just give it to a random one.

  • And it turns out this does very, very well.

  • Without knowing anything, just that simple approach

  • does great in practice.

  • And today, state of the art load balancers do this.

  • We've been doing randomized kinds of thing like this

  • at Akamai now for a decade.

  • And it's just stunning how well it works.

  • And so let's see why that is.

  • Of course we're going to use the Chernoff bound to do it.

  • So let's let Rij be the load on server Si from job Bj.

  • Now, if Bj is not assigned to Si, it's zero load.

  • Because it's not even doing the work there.

  • So we know that Rij equals the load of Bj

  • if it's assigned to Si.

  • And that happens with probability 1/M.

  • The job picks one of the M servers at random.

  • And otherwise, the load is 0.

  • Because it's not assigned to that server.

  • And that is probability 1 minus 1/M.

  • Now let's look at how much load gets

  • assigned by this random algorithm to server i.

  • So we'll let Ri be the sum of all the load assigned

  • to server i.

  • So we've got this indicator where the random variables

  • are not 0, 1.

  • They're 0 and whatever this load happens to be

  • for the j-th job, at most 1.

  • And we sum up the value for the contribution

  • to Si over all the jobs.

  • So now we compute the expected value

  • of Ri, the expected load on the i-th server.

  • So the expected load on the i-th server

  • is-- well, we use linearity of expectation.

  • And the expected value of Rij-- well, 0 or Lj.

  • It's Lj with probability 1/M. This

  • is just now the sum of Lj over M. And the sum of Lj is just L.

  • So the expected load of the i-th server

  • is the total load divided by the number

  • of servers, which is perfect.

  • It's optimal-- can't do better than that.

  • It makes sense.

  • If you assign all the jobs randomly,

  • every server is expecting to get 1/M of the total load.

  • Now we want to know the probability

  • it deviates from that, that you have too

  • much load on the i-th server.

  • All right, so the probability that the i-th server has

  • c times the optimal load is at most, by Chernoff,

  • if the jobs are independent, minus zL over M,

  • minus z times the expected load where z is c

  • ln c plus 1 minus c.

  • This is Chernoff now, just straight

  • from the formula of Chernoff, as long as these loads are

  • mutually independent.

  • All right, so we know that when c gets to be-- I don't know,

  • you pick 10% above optimal, c equals 1.1,

  • well, we know that this is going to be a very small number.

  • L/M is 2,500.

  • And z, in this case, we found was 0.0048.

  • So we get e to the minus 0.0048 times 2,500.

  • And that is really tiny.

  • That's less than 1 in 160,000.

  • So Chernoff tells us the probability

  • that any server, a particular server,

  • gets 10% load more than you expect is minuscule.

  • Now, we're not quite done.

  • That tells us the probability the first server gets

  • 10% too much load or the problem the second server got 10% too

  • much load, and so forth.

  • But what we really care about is the worst server.

  • If all of them are good except for one,

  • you're still in trouble.

  • Because the one ruined your day.

  • Because it didn't get the work done.

  • So what do you do to bound the probability

  • that any of the servers got too much load, any of the 10?

  • So what I really want to know is the probability

  • that the worst server of M takes more than cL

  • over M. Well, that's the probability that the first one

  • has more than cL over M union the second one has more than cL

  • over M union the M-th one.

  • What do I do to get that probability, the probability

  • of a union of events, upper bounded?

  • AUDIENCE: [INAUDIBLE]

  • PROFESSOR: Upper bounded by the sum of the individual guys.

  • It's the sum i equals 1 to M probability Ri greater than

  • or equal to cL over M. And so that, each of these

  • is at most 1 in 160,000.

  • This is at most M/160,000.

  • And that is equal to 1 in 16,000.

  • All right, so now we have the answer.

  • The chance that any server got 10% load or more

  • is 1 in 16,000 at most, which is why randomized load balancing

  • is used a lot in practice.

  • Now tomorrow, you're going to do a real world example where

  • people use this kind of analysis,

  • and it led to utter disaster.

  • And the reason was that the components they were looking at

  • were not independent.

  • And the example has to do with the subprime mortgage disaster.

  • And I don't have time today to go through it all.

  • But it's in the text, and you'll see it tomorrow.

  • But basically what happened is that they

  • took a whole bunch of loans, subprime loans,

  • put them into these things called bonds,

  • and then did an analysis about how many failures

  • they'd expect to have.

  • And they assumed the loans were all mutually independent.

  • And they applied their Chernoff bounds.

  • And that concluded that the chances

  • of being off from the expectation were nil, like e

  • to the minus 380.

  • In reality, the loans were highly dependent.

  • When one failed, a lot tended to fail.

  • And that led to disaster.

  • And you'll go through some of the math

  • on that tomorrow in recitation.

The following content is provided under a Creative

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it