Placeholder Image

Subtitles section Play video

  • This lecture is going to serve as an overview of what a probability distribution is and

  • what main characteristics it has.

  • Simply put, a distribution shows the possible values a variable can take and how frequently

  • they occur.

  • Before we start, let us introduce some important notation we will use for the remainder of

  • the course.

  • Assume thatupper-case Y” represents the actual outcome of an event andlowercase

  • y” represents one of the possible outcomes.

  • One way to denote the likelihood of reaching a particular outcome “y”, is P of, Y equals

  • y.

  • We can also express it as “p of y”.

  • For example, uppercase “Y” could represent the number of red marbles we draw out of a

  • bag and lowercase “y” would be a specific number, like 3 or 5.

  • Then, we express the probability of getting exactly 5 red marbles as “P, of Y equals

  • 5”, or “p of 5”.

  • Since “p of y” expresses the probability for each distinct outcome, we call this the

  • probability function.

  • Good job, folks!

  • So, probability distributions, or simply probabilities, measure the likelihood of an outcome depending

  • on how often it features in the sample space.

  • Recall that we constructed the probability frequency distribution of an event in the

  • introductory section of the course.

  • We recorded the frequency for each unique value and divide it by the total number of

  • elements in the sample space.

  • Usually, that is the way we construct these probabilities when we have a finite number

  • of possible outcomes.

  • If we had an infinite number of possibilities, then recording the frequency for each one

  • becomes impossible, becausethere are infinitely many of them!

  • For instance, imagine you are a data scientist and want to analyse the time it takes for

  • your code to run.

  • Any single compilation could take anywhere from a few the milliseconds to several days.

  • Often the result will be between a few milliseconds and a few minutes.

  • If we record time in seconds, we lose precision which we want to avoid.

  • To do so we need to use the smallest possible measurement of time.

  • Since every milli-, micro-, or even nanosecond could be split in half for greater accuracy,

  • no such thing exists.

  • Less than an hour from now we will talk in more detail about continuous distributions

  • and how to deal with them.

  • Let’s introduce some key definitions.

  • Now, regardless of whether we have a finite or infinite number of possibilities, we define

  • distributions using only two characteristicsmean and variance.

  • Simply put, the mean of the distribution is its average value.

  • Variance, on the other hand, is essentially how spread out the data is.

  • We measure thisspreadby how far away from the mean all the values are.

  • We denote the mean of a distribution as the Greek lettermuand its variance as

  • sigma squared”.

  • Okay.

  • When analysing distributions, it is important to understand what kind of data we have - population

  • data or sample data.

  • Population data is the formal way of referring toallthe data, while sample data is

  • just a part of it.

  • For example, if an employer surveys an entire department about how they travel to work,

  • the data would represent the population of the department.

  • However, this same data would also be just a sample of the employees in the whole company.

  • Something to remember when using sample data is that we adopt different notation for the

  • mean and variance.

  • We denote sample mean as “x barand sample variance as “s” squared.

  • One flaw of variance is that it is measured in squared units.

  • For example, if you are measuring time in seconds, the variance would be measured in

  • seconds squared.

  • Usually, there is no direct interpretation of that value.

  • To make more sense of variance, we introduce a third characteristic of the distribution,

  • called standard deviation.

  • Standard deviation is simply the positive square root of variance.

  • As you can suspect, we denote it assigmawhen dealing with a population, and as “s”

  • when dealing with a sample.

  • Unlike variance, standard deviation is measured in the same units as the mean.

  • Thus, we can directly interpret it and is often preferable.

  • One idea which we will use a lot is that any value betweenmu minus sigmaandmu

  • plus sigmafalls within one standard deviation away from the mean.

  • The more congested the middle of the distribution, the more data falls within that interval.

  • Similarly, the less data falls within the interval, the more dispersed the data is.

  • Fantastic!

  • It is important to know there exists a constant relationship between mean and variance for

  • any distribution.

  • By definition, the variance equals the expected value of the squared difference from the mean

  • for any value.

  • We denote this assigma squared, equals the expected value of Y minus mu, squared”.

  • After some simplification, this is equal to the expected value of “Y squaredminus

  • musquared.

  • As we will see in the coming lectures, if we are dealing with a specific distribution,

  • we can find a much more precise formula.

  • Okay, when we are getting acquainted with a certain dataset we want to analyse or make

  • predictions with, we are most interested in the mean, variance and type of the distribution.

  • In our next video we will introduce several distributions and the characteristics they

  • possess.

  • Thanks for watching!

  • 4.2 Types of distributions

  • Hello, again!

  • In this lecture we are going to talk about various types of probability distributions

  • and what kind of events they can be used to describe.

  • Certain distributions share features, so we group them into types.

  • Some, like rolling a die or picking a card, have a finite number of outcomes.

  • They follow discrete distributions and we use the formulas we already introduced to

  • calculate their probabilities and expected values.

  • Others, like recording time and distance in track & field, have infinitely many outcomes.

  • They follow continuous distributions and we use different formulas from the once we mentioned

  • so far.

  • Throughout the course of this video we are going to examine the characteristics of some

  • of the most common distributions.

  • For each one we will focus on an important aspect of it or when it is used.

  • Before we get into the specifics, you need to know the proper notation we implement when

  • defining distributions.

  • We start off by writing down the variable name for our set of values, followed by the

  • tildesign.

  • This is superseded by a capital letter depicting the type of the distribution and some characteristics

  • of the dataset in parenthesis.

  • The characteristics are usually, mean and variance but they may vary depending on the

  • type of the distribution.

  • Alright!

  • Let us start by talking about the discrete ones.

  • We will get an overview of them and then we will devote a separate lecture to each one.

  • So, we looked at problems relating to drawing cards from a deck or flipping a coin.

  • Both examples show events where all outcomes are equally likely.

  • Such outcomes are called equiprobable and these sorts of events follow a Uniform Distribution.

  • Then there are events with only two possible outcomestrue or false.

  • They follow a Bernoulli Distribution, regardless of whether one outcome is more likely to occur.

  • Any event with two outcomes can be transformed into a Bernoulli event.

  • We simply assign one of them to betrueand the other one to befalse”.

  • Imagine we are required to elect a captain for our college sports team.

  • The team consists of 7 native students and 3 international students.

  • We assign the captain being domestic to betrueand the captain being an international

  • asfalse”.

  • Since the outcome can now only betrueorfalse”, we have a Bernoulli distribution.

  • Now, if we want to carry out a similar experiment several times in a row, we are dealing with

  • a Binomial Distribution.

  • Just like the Bernoulli Distribution, the outcomes for each iteration are two, but we

  • have many iterations.

  • For example, we could be flipping the coin we mentioned earlier 3 times and trying to

  • calculate the likelihood of getting heads twice.

  • Lastly, we should mention the Poisson Distribution.

  • We use it when we want to test out how unusual an event frequency is for a given interval.

  • For example, imagine we know that so far Lebron James averages 35 points per game during the

  • regular season.

  • We want to know how likely it is that he will score 12 points in the first quarter of his

  • next game.

  • Since the frequency changes, so should our expectations for the outcome.

  • Using the Poisson distribution, we are able to determine the chance of Lebron scoring

  • exactly 12 points for the adjusted time interval.

  • Great, now on to the continuous distributions!

  • One thing to remember is that since we are dealing with continuous outcomes, the probability

  • distribution would be a curve as opposed to unconnected individual bars.

  • The first one we will talk about is the Normal Distribution.

  • The outcomes of many events in nature closely resemble this distribution, hence the name

  • Normal”.

  • For instance, according to numerous reports throughout the last few decades, the weight

  • of an adult male polar bear is usually around 500 kilograms.

  • However, there have been records of individual species weighing anywhere between 350kg and

  • 700kg.

  • Extreme values, like 350 and 700, are called outliers and do not feature very frequently

  • in Normal Distributions.

  • Sometimes, we have limited data for events that resemble a Normal distribution.

  • In those cases, we observe the Student’s-T distribution.

  • It serves as a small sample approximation of a Normal distribution.

  • Another difference is that the Student’s-T accommodates extreme values significantly

  • better.

  • Graphically, that is represented by the curve having fattertails”.

  • Overall, this results in more values extremely far away from the mean, so the curve would

  • probably more closely resemble a Student’s-T distribution than a Normal distribution.

  • Now imagine only looking at the recorded weights of the last 10 sightings across Alaska and

  • Canada.

  • The lower number of elements would make the occurrence of any extreme value represent

  • a much bigger part of the population than it should.

  • Good job, everyone!

  • Another continuous distribution we would like to introduce is the Chi-Squared distribution.

  • It is the first asymmetric continuous distribution we are dealing with as it only consists of

  • non-negative values.

  • Graphically, that means that the Chi-Squared distribution always starts from 0 on the left.

  • Depending on the average and maximum values within the set, the curve of the Chi Squared

  • graph is usually skewed to the left.

  • Unlike the previous two distributions, the Chi-Squared does not often mirror real life

  • events.

  • However, it is often used in Hypothesis Testing to help determine goodness of fit.

  • The next distribution on our list is the Exponential distribution.

  • The Exponential distribution is usually present when we are dealing with events that are rapidly

  • changing early on.

  • An easy to understand example is how online news articles generates hits.

  • They get most of their clicks when the topic is still fresh.

  • The more time passes, the more irrelevant it becomes as interest dies off.

  • The last continuous distribution we will mention is the Logistic distribution.

  • We often find it useful in forecast analysis when we try to determine a cut-off point for

  • a successful outcome.

  • For instance, take a competitive e-sport like Dota 2 . We can use a Logistic distribution

  • to determine how much of an in-game advantage at the 10-minute mark is necessary to confidently

  • predict victory for either team.

  • Just like with other types of forecasting, our predictions would never reach true certainty

  • but more on that later!

  • Woah!

  • Good job, folks!

  • In the next video we are going to focus on discrete distributions.

  • We will introduce formulas for competing Expected Values and Standard Deviations before looking

  • into each distribution individually.

  • Thanks for watching!

  • 4.3 Discrete Distributions

  • Welcome back!

  • In this video we will talk about discrete distributions and their characteristics.

  • Let’s get started!

  • Earlier in the course we mentioned that events with discrete distributions have finitely

  • many distinct outcomes.

  • Therefore, we can express the entire probability distribution with either a table, a graph

  • or a formula.

  • To do so we need to ensure that every unique outcome has a probability assigned to it.

  • Imagine you are playing darts.

  • Each distinct outcome has some probability assigned to it based on how big its associated

  • interval is.

  • Since we have finitely many possible outcomes, we are dealing with a discrete distribution.

  • Great!

  • In probability, we are often more interested in the likelihood of an interval than of an

  • individual value.

  • With discrete distributions, we can simply add up the probabilities for all the values

  • that fall within that range.

  • Recall the example where we drew a card 20 times.

  • Suppose we want to know the probability of drawing 3 spades or fewer.

  • We would first calculate the probability of getting 0, 1, 2 or 3 spades and then add them

  • up to find the probability of drawing 3 spades or fewer.

  • One peculiarity of discrete events is that theThe probability of Y being less than

  • or equal to y equals the probability of Y being less than y plus 1”.

  • In our last example, that would mean getting 3 spades or fewer is the same as getting fewer

  • than 4 spades.

  • Alright!

  • Now that you have an idea about discrete distributions, we can start exploring each type in more detail.

  • In the next video we are going to examine the Uniform Distribution.

  • Thanks for watching!

  • 4.4 Uniform Distribution

  • Hey, there!

  • In this lecture we are going to discuss the uniform distribution.

  • For starters, we use the letter U to define a uniform distribution, followed by the range

  • of the values in the dataset.

  • Therefore, we read the following statement asVariable “X” follows a discrete

  • uniform distribution ranging from 3 to 7”.

  • Events which follow the uniform distribution, are ones where all outcomes have equal probability.

  • One such event is rolling a single standard six-sided die.

  • When we roll a standard 6-sided die, we have equal chance of getting any value from 1 to

  • 6.

  • The graph of the probability distribution would have 6 equally tall bars, all reaching

  • up to one sixth.

  • Many events in gambling provide such odds, where each individual outcome is equally likely.

  • Not only that, but many everyday situations follow the Uniform distribution.

  • If your friend offers you 3 identical chocolate bars, the probabilities assigned to you choosing

  • one of them also follow the Uniform distribution.

  • One big drawback of uniform distributions is that the expected value provides us no

  • relevant information.

  • Because all outcomes have the same probability, the expected value, which is 3.5, brings no

  • predictive power.

  • We can still apply the formulas from earlier and get a mean of 3.5 and a variance of 105

  • over 36.

  • These values, however, are completely uninterpretable and there is no real intuition behind what

  • they mean.

  • They main takeaway is that when an event is following the Uniform distribution, each outcome

  • is equally likely.

  • Therefore, both the mean and the variance are uninterpretable and possess no predictive

  • power whatsoever.

  • Okay!

  • Sadly, the Uniform is not the only discrete distribution, for which we cannot construct

  • useful prediction intervals.

  • In the next video we will introduce the Bernoulli Distribution.

  • Thanks for watching!

  • 4.5 Bernoulli Distribution

  • Hello again!

  • In this lecture we are going to discuss the Bernoulli distribution.

  • Before we begin, we useBernto define a Bernoulli distribution, followed by the

  • probability of our preferred outcome in parenthesis.

  • Therefore, we read the following statement asVariable “X” follows a Bernoulli

  • distribution with a probability of success equal to “p””.

  • Okay!

  • We need to describe what types of events follow a Bernoulli distribution.

  • Any event where we have only 1 trial and two possible outcomes follows such a distribution.

  • These may include a coin flip, a single True or False quiz question, or deciding whether

  • to vote for the Democratic or Republican parties in the US elections.

  • Usually, when dealing with a Bernoulli Distribution, we either have the probabilities of either

  • event occurring, or have past data indicating some experimental probability.

  • In either case, the graph of a Bernoulli distribution is simple.

  • It consists of 2 bars, one for each of the possible outcomes.

  • One bar would rise up to its associated probability of “p”, and the other one would only reach

  • “1 minus p”.

  • For Bernoulli Distributions we often have to assign which outcome is 0, and which outcome

  • is 1.

  • After doing so, we can calculate the expected value.

  • Have in mind that depending on how we assign the 0 and the 1, our expected value will be

  • equal to either “p” or “1 minus p”.

  • We usually denote the higher probability with “p”, and the lower one with “1 minus

  • p”.

  • Furthermore, conventionally we also assign a value of 1 to the event with probability

  • equal to “p”.

  • That way, the expected value expresses the likelihood of the favoured event.

  • Since we only have 1 trial and a favoured event, we expect that outcome to occur.

  • By plugging in “p” and “1 minus p” into the variance formula, we get that the

  • variance of Bernoulli events would always equal “p, times 1 minus p”.

  • That is true, regardless of what the expected value is.

  • Here’s the first instance where we observe how elegant the characteristics of some distributions

  • are.

  • Once again, we can calculate the variance and standard deviation using the formulas

  • we defined earlier, but they bring us little value.

  • For example, consider flipping an unfair coin.

  • This coin is calledunfairbecause its weight is spread disproportionately, and it

  • gets tails 60% of the time.

  • We assign the outcome of tails to be 1, and p to equal 0.6.

  • Therefore, the expected value would be “p”, or 0.6.

  • If we plug in this result into the variance formula, we would get a variance of 0.6, times

  • 0.4, or 0.24.

  • Great job, everybody!

  • Sometimes, instead of wanting to know which of two outcomes is more probable, we want

  • to know how often it would occur over several trials.

  • In such cases, the outcomes follow a Binomial distribution and we will explore it further

  • in the next lecture.

  • 4.6 Binomial Distribution

  • Welcome back!

  • In the last video, we mentioned Binomial Distributions.

  • In essence, Binomial events are a sequence of identical Bernoulli events.

  • Before we get into the difference and similarities between these two distributions, let us examine

  • the proper notation for a Binomial Distribution.

  • We use the letter “B” to express a Binomial distribution, followed by the number of trials

  • and the probability of success in each one.

  • Therefore, we read the following statement asVariable “X” follows a Binomial

  • distribution with 10 trials and a likelihood of success of 0.6 on each individual trial”.

  • Additionally, we can express a Bernoulli distribution as a Binomial distribution with a single trial.

  • Alright!

  • To better understand the differences between the two types of events, suppose the following

  • scenario.

  • You go to class and your professor gives the class a surprise pop-quiz, for which you have

  • not prepared.

  • Luckily for you, the quiz consists of 10 true or false problems.

  • In this case, guessing a single true or false question is a Bernoulli event, but guessing

  • the entire quiz is a Binomial Event.

  • Alright!

  • Let’s go back to the quiz example we just mentioned.

  • In it, the expected value of the Bernoulli distribution suggests which outcome we expect

  • for a single trial.

  • Now, the expected value of the Binomial distribution would suggest the number of times we expect

  • to get a specific outcome.

  • Great!

  • Now, the graph of the binomial distribution represents the likelihood of attaining our

  • desired outcome a specific number of times.

  • If we run n trials, our graph would consist “n + 1”-many bars - one for each unique

  • value from 0 to n.

  • For instance, we could be flipping the same unfair coin we had from last lecture.

  • If we toss it twice, we need bars for the three different outcomes - zero, one or two

  • tails.

  • Fantastic!

  • If we wish to find the associated likelihood of getting a given outcome a precise number

  • of times over the course of n trials, we need to introduce the probability function of the

  • Binomial distribution.

  • For starters, each individual trial is a Bernoulli trial, so we express the probability of getting

  • our desired outcome as “p” and the likelihood of the other one as “1 minus p”.

  • In order to get our favoured outcome exactly y-many times over the n trials, we also need

  • to get the alternative outcome “n minus y”-many times.

  • If we don’t account for this, we would be estimating the likelihood of getting our desired

  • outcome at least y-many times.

  • Furthermore, there could exist more than one way to reach our desired outcome.

  • To account for this, we need to find the number of scenarios in which “y” out of the “n”-many

  • outcomes would be favourable.

  • But these are actually thecombinationswe already know!

  • For instance, If we wish to find out the number of ways in which 4 out of the 6 trials can

  • be successful, it is the same as picking 4 elements out of a sample space of 6.

  • Now you see why combinatorics are a fundamental part of probability!

  • Thus, we need to find the number of combinations in which “y” out of the “n” outcomes

  • would be favourable.

  • For instance, there are 3 different ways to get tails exactly twice in 3 coin flips.

  • Therefore, the probability function for a Binomial Distribution is the product of the

  • number of combinations of picking y-many elements out of n, times “p” to the power of y,

  • times “p - 1” to the power of “n minus p”.

  • Great!

  • To see this in action, let us look at an example.

  • Imagine you bought a single stock of General Motors.

  • Historically, you know there is a 60% chance the price of your stock will go up on any

  • given day, and a 40% chance it will drop.

  • By the price going up, we mean that the closing price is higher than the opening price.

  • With the probability distribution function, you can calculate the likelihood of the stock

  • price increasing 3 times during the 5-work-day week.

  • If we wish to use the probability distribution formula, we need to plug in 3 for “y”,

  • 5 for “n” and 0.6 for “p”.

  • After plugging in we get: “number of different possible combinations of picking 3 elements

  • out of 5, times 0.6 to the power of 3, times 0.4 to the power of 2”.

  • This is equivalent to 10, times 0.216, times 0.16, or 0.3456.

  • Thus, we have a 34.56% of getting exactly 3 increases over the course of a work week.

  • The big advantage of recognizing the distribution is that you can simply use these formulas

  • and plug-in the information you already have!

  • Alright!

  • Now that we know the probability function, we can move on to the expected value.

  • By definition, the expected value equals the sum of all values in the sample space, multiplied

  • by their respective probabilities.

  • The expected value formula for a Binomial event equals the probability of success for

  • a given value, multiplied by the number of trials we carry out.

  • This seems familiar, because this is the exact formula we used when computing the expected

  • values for categorical variables in the beginning of the course.

  • After computing the expected value, we can finally calculate the variance.

  • We do so by applying the short formula we learned earlier:

  • Variance of Y equals the expected value of Y square, minus the expected value of Y,

  • squared.”

  • After some simplifications, this results in “n, times p, times p minus 1”.

  • If we plug in the values from our stock market example, that gives us a variance of 5, times

  • 0.6, times 0.4, or 1.2.

  • This would give us a standard deviation of approximately 1.1.

  • Knowing the expected value and the standard deviation allows us to make more accurate

  • future forecasts.

  • Fantastic!

  • In the next video we are going to discuss Poisson Distributions.

  • Thanks for watching!

  • 4.7 Poisson Distribution

  • Hello again!

  • In this lecture we are going to discuss the Poisson Distribution and its main characteristics.

  • For starters, we denote a Poisson distribution with the lettersPoand a single value

  • parameter - lambda.

  • We read the statement below asVariable “Y” follows a Poisson distribution with

  • lambda equal to 4”.

  • Okay!

  • The Poisson Distribution deals with the frequency with which an event occurs in a specific interval.

  • Instead of the probability of an event, the Poisson Distribution requires knowing how

  • often it occurs for a specific period of time or distance.

  • For example, a firefly might light up 3 times in 10 seconds on average.

  • We would use a Poisson Distribution if we want to determine the likelihood of it lighting

  • up 8 times in 20 seconds.

  • The graph of the Poisson distribution plots the number of instances the event occurs in

  • a standard interval of time and the probability for each one.

  • Thus, our graph would always start from 0, since no event can happen a negative amount

  • of times.

  • However, there is no cap to the amount of times it could occur over the time interval.

  • Okay, let us explore an example.

  • Imagine you created an online course on probability.

  • Usually, your students ask you around 4 questions per day, but yesterday they asked 7.

  • Surprised by this sudden spike in interest from your students, you wonder how likely

  • it was that they asked exactly 7 questions.

  • In this example, the average questions you anticipate is 4, so lambda equals 4.

  • The time interval is one entire work day and the singular instance you are interested in

  • is 7.

  • Therefore, “y” is 7.

  • To answer this question, we need to explore the probability function for this type of

  • distributions.

  • Alright!

  • As you already saw, the Poisson Distribution is wildly different from any other we have

  • gone over so far.

  • It comes without much surprise that its probability function is much different from anything we

  • have examined so far.

  • The formula looks the following way: “p of y, equals, lambda to the power of

  • y, over y factorial, times the Euler’s number to the power of negative lambda”.

  • Before we plug in the values from our course-creation example, we need to make sure you understand

  • the entire formula.

  • Let’s refresh your knowledge of the various parts of this formula.

  • First, the “e” you see on your screens is known as Euler’s number or Napier’s

  • constant.

  • As the second name suggests, it is a fixed value approximately equal to 2.72.

  • We commonly observe it in physics, mathematics and nature, but for the purposes of this example

  • you only need to know its value.

  • Secondly, a number to the power ofnegative n”, is the same as dividing 1 by that number

  • to the power of n.

  • In this case, “e to the power or negative lambdais just “1 over, e to the power

  • of lambda”.

  • Right!

  • Going back to our example, the probability of receiving 7 questions is equal to “4,

  • raised to the 7th degree, over 7 factorial, multiplied by “E” raised to the negative

  • lambda”.

  • That approximately equals 16384 over 5040, times 0.183, or 0.06.

  • Therefore, there was only a 6% chance of receiving exactly 7 questions.

  • So far so good!

  • Knowing the probability function, we can calculate the expected value.

  • By definition, the expected value of Y, equals the sum of all the products of a distinct

  • value in the sample space and its probability.

  • By plugging in, we get this complicated expression.

  • In the additional materials attached to this lecture, you can see all the complicated algebra

  • required to simplify this.

  • Eventually, we get that the expected value is simply lambda.

  • Similarly, by applying the formulas we already know, the variance also ends up being equal

  • to lambda.

  • Both the mean and variance being equal to lambda serves as yet another example of the

  • elegant statistics these distributions possess and why we can take advantage of them.

  • Great job, everyone!

  • Now, if we wish to compute the probability of an interval of a Poisson distribution,

  • we take the same steps we usually do for discrete distributions.

  • We find the joint probability of all individual elements within it.

  • You will have a chance to practice this in the exercises after this lecture.

  • So far, we have discussed Uniform, Bernoulli, Binomial and Poisson distributions, which

  • are all discrete.

  • In the next video we will focus on continuous distributions and see how differ.

  • Thanks for watching!

  • 4.8 Continuous Distributions

  • Hello again!

  • When we started this section of the course, we mentioned how some events have infinitely

  • many consecutive outcomes.

  • We call such distributions continuous and they vastly differ from discrete ones.

  • For starters, their sample space is infinite.

  • Therefore, we cannot record the frequency of each distinct value.

  • Thus, we can no longer represent these distributions with a table.

  • What we can do is represent them with a graph.

  • More precisely, the graph of the probability density function, or PDF for short.

  • We denote it as “f of y”, where “y” is an element of the sample space.

  • As the name suggests, the function depicts the associated probability for every possible

  • value “y”.

  • Since it expresses probability, the value it associates with any element of the sample

  • space would be greater than or equal to zero.

  • Great!

  • The graphs for continuous distributions slightly resemble the ones for discrete distributions.

  • However, there are more elements in the sample space, so there are more bars on the graph.

  • Furthermore, the more bars - the narrower each one must be.

  • This results in a smooth curve that goes along the top of these bars.

  • We call this the probability distribution curve, since it shows the likelihood of each

  • outcome.

  • Now on to some further differences between Distinct and Continuous.

  • Imagine we used thefavoured over allformula to calculate probabilities for such

  • variables.

  • Since the sample space is infinite, the likelihood of each individual one would be extremely

  • small.

  • Algebra dictates that, assuming the numerator stays constant, the greater the denominator

  • becomes, the closer the fraction is to 0.

  • For reference, one third is closer to 0 than a half, and a quarter is closer to 0 than

  • either of them.

  • Since the denominator of thefavoured over allformula would be so big, it is commonly

  • accepted that such probabilities are extremely insignificant.

  • In fact, we assume their likelihood of occurring to be essentially 0.

  • Thus, it is accepted that the probability for any individual value from a continuous

  • distribution to be equal to 0.

  • This assumption is crucial in understanding whythe likelihood of an event being strictly

  • greater than X, is equal to the likelihood of the even being greater than or equal to

  • X” for some value X within the sample space.

  • For example, the probability of a college student running a mile in under 6 minutes

  • is the same as them running it for at most 6 minutes.

  • That is because we consider the likelihood of finishing in exactly 6 minutes to be 0.

  • That wasn’t too complicated, right?

  • So far, we have been using the termprobability functionto refer to the Probability Density

  • Function of a distribution.

  • All the graphs we explored for discrete distributions were depicting their PDFs.

  • Now, we need to introduce the Cumulative Distribution Function, or CDF for short.

  • Since it is cumulative, this function encompasses everything up a certain value.

  • We denote the CDF as capital F of y for any continuous random variable Y.

  • As the name suggest, it represents probability of the random variable being lower than or

  • equal to a specific value.

  • Since no value could be lower than or equal to negative infinity, the CDF value for negative

  • infinity would equal 0.

  • Similarly, since any value would be lower than plus infinity, we would get a 1 if we

  • plug plus infinity into the distribution function.

  • Discrete distributions also have CDFs, but they are far less frequently used.

  • That is because we can always add up the PDF values associated with the individual probabilities,

  • we are interested in.

  • Good job, folks!

  • The CDF is especially useful when we want to estimate the probability of some interval.

  • Graphically, the area under the density curve would represent the chance of getting a value

  • within that interval.

  • We find this area is by computing the integral of the density curve over the interval from

  • “a” to “b”.

  • For those of you who do not know how to calculate integrals, you can use some free online software

  • likeWolfram Alpha dot com”.

  • If you understand probability correctly, determining and calculating these integrals should feel

  • very intuitive.

  • Alright!

  • Notice how the cumulative probability is simply the probability of the interval from negative

  • infinity to ‘y’.

  • For those that know calculus, this suggest that the CDF for a specific value “y”

  • is equal to the integral of the density function over the interval from minus infinity to “y”.

  • This gives us a way to obtain the CDF from the PDF.

  • The opposite of integration is derivation, so to attain a PDF from a CDF, we would have

  • to find its first derivative.

  • In more technical terms, the PDF for any element of the sample space ‘y’, equals the first

  • derivative of the CDF with respect to ‘y’.

  • Okay!

  • Often times, when dealing with continuous variables, we are only given their probability

  • density functions.

  • To understand what its graph looks like, we should be able to compute the expected value

  • and variance for any PDF.

  • Let’s start with expected values!

  • The probability of each individual element “y” is 0.

  • Therefore, we cannot apply the summation formula we used for discrete outcomes.

  • When dealing with continuous distributions, the expected value is an integral.

  • More specifically, it is an integral of the product of any element “y” and its associated

  • PDF value, over the interval from negative infinity to positive infinity.

  • Right!

  • Now, let us quickly discuss the variance.

  • Luckily for us, we can still apply the same variance formula we used earlier for discrete

  • distributions.

  • Namely, the variance is equal to the expected value of the squared variable, minus the expected

  • value of the variable, squared.

  • Marvellous work!

  • We now know the main characteristics of any continuous distribution, so we can begin exploring

  • specific types.

  • In the next lecture we will introduce the Normal Distribution and its main features.

  • Thanks for watching!

  • 4.9 Normal Distribution

  • Welcome back!

  • In this lecture we are going to introduce one of the most commonly found continuous

  • distributionsthe normal distribution.

  • For starters, we define a Normal Distribution using a capital letter N followed by the mean

  • and variance of the distribution.

  • We read the following notation asVariable “X” follows a Normal Distribution with

  • meanmuand variancesigmasquared”.

  • When dealing with actual data we would usually know the numerical values of mu and sigma

  • squared.

  • The normal distribution frequently appears in nature, as well as in life, in various

  • shapes of forms.

  • For example, the size of a full-grown male lion follows a normal distribution.

  • Many records suggest that the average lion weight between 150 and 250 kilograms, or 330

  • to 550 pounds.

  • Of course, there exist specimen which fall outside of this range.

  • Lions weighing less than 150, or more than 250 kilograms tend to be the exception rather

  • than the rule.

  • Such individuals serve as outliers in our set and the more data we gather, the lower

  • part of the data they represent.

  • Now that you know what types of events follow a Normal distribution, let us examine some

  • of its distinct characteristics.

  • For starters, the graph of a Normal Distribution is bell-shaped.

  • Therefore, the majority of the data is centred around the mean.

  • Thus, values further away from the mean are less likely to occur.

  • Furthermore, we can see that the graph is symmetric with regards to the mean.

  • That suggests values equally far away in opposing directions, would still be equally likely.

  • Let’s go back to the lion example from earlier.

  • If the mean is 400, symmetry suggests a lion is equally likely to weigh 350 pounds and

  • 450 pounds since both are 50 pounds away from that the mean.

  • Alright!

  • For anybody interested, you can find the CDF and the PDF of the Normal distribution in

  • the additional materials for this lecture.

  • Instead of going through the complex algebraic simplifications in this lecture, we are simply

  • going to talk about the expected value and the variance.

  • The expected value for a Normal distribution equals its mean - “mu”, whereas its variance

  • sigmasquared is usually given when we define the distribution.

  • However, if it isn’t, we can deduce it from the expected value.

  • To do so we must apply the formula we showed earlier: “The variance of a variable is

  • equal to the expected value of the squared variable, minus the squared expected value

  • of the variable”.

  • Good job!

  • Another peculiarity of the Normal Distribution is the “68, 95, 99.7” law.

  • This law suggests that for any normally distributed event, 68% of all outcomes fall within 1 standard

  • deviation away from the mean, 95% fall within two standard deviations and 99.7 - within

  • 3.

  • The last part really emphasises the fact that outliers are extremely rare in Normal distributions.

  • It also suggests how much we know about a dataset only if we have the information that

  • it is normally distributed!

  • Fantastic work, everyone!

  • Before we move on to other types of distributions, you need to know that we can use this table

  • to analyse any Normal Distribution.

  • To do this we need to standardize the distribution, which we will explain in detail in the next

  • video.

  • Thanks for watching!

  • 4.9.1 Standardizing a Normal Distribution Welcome back, everybody!

  • Towards the end of the last lecture we mentioned standardizing without explaining what it is

  • and why we use it.

  • Before we understand this concept, we need to explain what a transformation is.

  • So, a transformation is a way in which we can alter every element of a distribution

  • to get a new distribution with similar characteristics.

  • For Normal Distributions we can use addition, subtraction, multiplication and division without

  • changing the type of the distribution.

  • For instance, if we add a constant to every element of a Normal distribution, the new

  • distribution would still be Normal.

  • Let’s discuss the four algebraic operations and see how each one affects the graph.

  • If we add a constant, like 3, to the entire distribution, then we simply need to move

  • the graph 3 places to the right.

  • Similarly, if we subtract a number from every element, we would simply move our current

  • graph to the left to get the new one.

  • If we multiply the function by a constant it will widen that many times and if we divide

  • every element by a number, the graph will shrink.

  • However, if we multiply or divide by a number between 0 and 1, the opposing effects will

  • occur.

  • For example, dividing by a half, is the same as multiplying by 2, so the graph would expand,

  • even though we are dividing.

  • Alright!

  • Now that you know what a transformation is, we can explain standardizing.

  • Standardizing is a special kind of transformation in which we make the expected value equal

  • to 0 and the variance equal to 1.

  • The benefit of doing so, is that we can then use the cumulative distribution table from

  • last lecture on any element in the set.

  • The distribution we get after standardizing any Normal distribution, is called a “Standard

  • Normal Distribution”.

  • In addition to the “68, 95, 99.7” rule, there exists a table which summarizes the

  • most commonly used values for the CDF of a Standard Normal Distribution.

  • This table is known as the Standard Normal Distribution table or the “Z”- score table.

  • Okay!

  • So far, we learned what standardizing is and why it is convenient.

  • What we haven’t talked about is how to do it.

  • First, we wish to move the graph either to the left, or to the right until its mean equals

  • 0.

  • The way we would do that is by subtracting the meanmufrom every element.

  • After this to make the standardization complete, we need to make sure the standard deviation

  • is 1.

  • To do so, we would have to divide every element of the newly obtained distribution by the

  • value of the standard deviation, sigma.

  • If we denote the Standard Normal Distribution with Z, then for any normally distributed

  • variable Y, “Z equals Y minus mu, over sigma”.

  • This equation expresses the transformation we use when standardizing.

  • Amazing!

  • Applying this single transformation for any Normal Distribution would result in a Standard

  • Normal Distribution, which is convenient.

  • Essentially, every element of the non-standardized distribution is represented in the new distribution

  • by the number of standard deviations it is away from the mean.

  • For instance, if some value y is 2.3 standard deviations away from the mean, its equivalent

  • value “Z” would be equal to 2.3.

  • Standardizing is incredibly useful when we have a Normal Distribution, however we cannot

  • always anticipate that the data is spread out that way.

  • A crucial fact to remember about the Normal distribution is that it requires a lot of

  • data.

  • If our sample is limited, we run the risk of outliers drastically affecting our analysis.

  • In cases where we have less than 30 entries, we usually avoid assuming a Normal distribution.

  • However, there exists a small sample size approximation of a Normal distribution called

  • the Students’ T distribution and we are going to focus on it in our next lecture.

  • Thanks for watching.

  • 4.10 Students’ T Distribution Hello, folks!

  • In this lesson we are going to talk about the Students’ T distribution and its characteristics.

  • Before we begin, we use the lower-case letter “t” to define a Students’ T distribution,

  • followed by a single parameter in parenthesis, calleddegrees of freedom”.

  • We read this next statement asVariable “Y” follows a Students’ T distribution

  • with 3 degrees of freedom”.

  • As we mentioned in the last video, it is a small sample size approximation of a Normal

  • Distribution.

  • In instances, where we would assume a Normal distribution were it not for the limited number

  • of observations, we use the Students’ T distribution.

  • For instance, the average lap times for the entire season of a Formula 1 race follow a

  • Normal Distribution, but the lap times for the first lap of the Monaco Grand Prix would

  • follow a Students’ T distribution.

  • Now, the curve of the students’ T distribution is also bell-shaped and symmetric.

  • However, it has fatter tails to accommodate the occurrence of values far away from the

  • mean.

  • That is because if such a value features in our limited data, it would be representing

  • a bigger part of the total.

  • Another key difference between the Students’ T Distribution and the Normal one is that

  • apart from the mean and variance, we must also define the degrees of freedom for the

  • distribution.

  • Great job!

  • As long as we have at least 2 degrees of freedom, the expected value of a t-distribution is

  • the meanmu”.

  • Furthermore, the variance of the distribution equals: the variance of the sample, times

  • number of degrees of freedom over, degrees of freedom minus two.

  • Overall the Students’ T distribution is frequently used when conducting statistical

  • analysis.

  • It plays a major role when we want to do hypothesis testing with limited data, since we also have

  • a table summarizing the most important values of its CDF.

  • Great!

  • Another distribution that is commonly used in statistical analysis is the Chi-squared

  • Distribution.

  • In the next video we will explore when we use it and what other distributions it is

  • related to.

  • Thanks for watching!

  • 4.11 Chi -squared Distribution

  • Welcome back, folks!

  • This is going to be a short lecture where we introduce to you the Chi-squared Distribution.

  • For starters, we define a denote a Chi-Squared distribution with the capital Greek letter

  • Chi, squared followed by a parameter “k” depicting the degrees of freedom.

  • Therefore, we read the following asVariable “Y” follows a Chi-Square distribution

  • with 3 degrees of freedom”.

  • Alright!

  • Let’s get started!

  • Very few events in real life follow such a distribution.

  • In fact, Chi-Squared is mostly featured in statistical analysis when doing hypothesis

  • testing and computing confidence intervals.

  • In particular, we most commonly find it when determining the goodness of fit of categorical

  • values.

  • That is why any example we can give you would feel extremely convoluted to anyone not familiar

  • with statistics.

  • Alright!

  • Now, let’s explore the graph of the Chi-Squared distribution.

  • Just by looking at it, you can tell the distribution is not symmetric, but ratherasymmetric.

  • Its graph is highly-skewed to the right.

  • Furthermore, the values depicted on the X-axis start form 0, rather than some negative number.

  • This, by the way, shows you yet another transformation.

  • Elevating the Student’s T distribution to the second power gives us the Chi-squared

  • and vice versa: finding the square root of the Chi-squared distribution gives us the

  • Student’s T.

  • Great!

  • So, a convenient feature of the Chi-Squared distribution is that it also contains a table

  • of known values, just like the Normal or Students’–T distributions.

  • The expected value for any Chi-squared distribution is equal to its associated degrees of freedom,

  • k.

  • Its variance is equal to two times the degrees of freedom, or simply 2 times k.

  • To learn more about Hypothesis Testing and Confidence Intervals you can continue with

  • our program, where we dive into those.

  • For now, you know all you need to about the Chi-Squared Distribution.

  • Thanks for watching!

  • 4.12 Exponential Distribution Hello again!

  • In this lecture, we are going to discuss the Exponential distribution and its main characteristics.

  • For starters, we define the exponential distribution with the abbreviationExpfollowed by

  • a scale parameter - lambda.

  • We read the following statement asVariable “X” follows an exponential distribution

  • with a scale of a half”.

  • Alright!

  • Variables which most closely follow an exponential distribution, are ones with a probability

  • that initially decreases, before eventually plateauing.

  • One such example is the aggregate number of views for a Youtube vlog video.

  • There is great interest upon release, so it starts off with many views in the first day

  • or two.

  • After most subscribers have had the chance to see the video, the view-counter slows down.

  • Even though the aggregate amount of views keeps increasing, the number of new ones diminishes

  • daily.

  • As time goes on, the video either becomes outdated or the author produces new content,

  • so viewership focus shifts away.

  • Therefore, it is most likely for a random viewing to have occurred close to the video’s

  • initial release, then in any of the following periods.

  • Graphically, the PDF of such a function would start off very high and sharply decrease within

  • the first few time frames.

  • The curve somewhat resembles a boomerang with each handle lining up with the X and the Y

  • axes.

  • Alright!

  • We know what the PDF would look like, but what about the CDF?

  • In a weird way, the CDF would also resemble a boomerang.

  • However, this one is shifted 90 degrees to the right.

  • As you know the cumulative distribution eventually approaches 1, so that would be the value where

  • it plateaus.

  • To define an exponential distribution, we require a rate parameter denoted by the Greek

  • letterlambda”.

  • This parameter determine how fast the PDF curve reaches the point of plateauing and

  • how spread out the graph is.

  • Alright!

  • Let’s talk about the expected value and the variance.

  • The expected value for an exponential distribution is equal to 1 over the rate parameter lambda,

  • whilst the variance is 1 over lambda squared.

  • In data analysis, we end up using exponential distributions quite often.

  • However, unlike the normal or chi-squared distributions, we do not have a table of known

  • variables for it.

  • That is why sometimes we prefer to transform it.

  • Generally, we can take the natural logarithm of every set of an exponential distribution

  • and get a normal distribution.

  • In statistics we can use this new transformed data to run say linear regressions.

  • This is one of the most common transformations I’ve had to perform.

  • Before we move on, we need to introduce an extremely important type of distribution that

  • is often used in mathematical modelling.

  • We are going to focus on the logistic distribution and its main characteristics in the next video!

  • Thanks for watching!

  • 4.13 Logistic Distribution Welcome back!

  • In this lecture, we are going to focus on the continuous logistic probability distribution.

  • We denote a Logistic Distribution with the entire worldLogisticfollowed by two

  • parameters, its mean and scale parameter like the one for the Exponential distribution.

  • We also refer to the mean parameter as thelocationand we shall use the terms

  • interchangeably for the remainder of the video.

  • Thus, we read the statement below asVariable “Y” follows a Logistic distribution with

  • location 6 and a scale of 3”.

  • Alright!

  • We often encounter logistic distributions when trying to determine how continuous variable

  • inputs can affect the probability of a binary outcome.

  • This approach is commonly found in forecasting competitive sports events, where there exist

  • only two clear outcomesvictory or defeat.

  • For instance, we can analyse whether the average speed of a tennis player’s serve plays a

  • crucial role in the outcome of the match.

  • Expectation dictate that sending the ball with higher velocity leaves opponents with

  • a shorter period to respond.

  • This usually results in a better hit, which could lead to a point for the server.

  • To reach the highest speeds tennis players often give up some control over the shot so

  • are less accurate.

  • Therefore, we cannot assume that there is a linear relationship between point conversion

  • and serve speeds.

  • Theory suggests there exists some optimal speed, which enables the serve to still be

  • accurate enough.

  • Then, most of the shots we convert into points will likely have similar velocities.

  • As tennis players go further away from the optimal speed, their shots either become too

  • slow and easy to handle, or too inaccurate.

  • This suggests that the graph of the PDF of the Logistic Distribution would look similarly

  • to the Normal Distribution.

  • Actually, the graph of the Logistic Distribution is defined by two key featuresits mean

  • and its scale parameter.

  • The former dictates the centre of the graph, whilst the latter shows how spread out the

  • graph is going to be.

  • Going back to the tennis example, the mean would represent the optimal speed, whilst

  • the scale would dictate how lenient we can be with the hit.

  • To elaborate, some tennis players can hit a great serve further away from their optimal

  • speed than others.

  • For instance, Serena Williams can hit fantastic serves even if the ball moves much faster

  • or slower than it optimally should.

  • Therefore, she is going to have a more spread out PDF, than some of her opponents.

  • Fantastic!

  • Now, let’s discuss the Cumulative Distribution Function.

  • It should be a curve that starts off slow, then picks up rather quickly before plateauing

  • around the 1 mark.

  • That is because once we reach values near the mean, the probability of converting the

  • point drastically goes up.

  • Once again, the scale would dictate the shape of the graph.

  • In this case, the smaller the scale, the later the graph starts to pick up, but the quicker

  • it reaches values close to 1.

  • Okay!

  • You can use expected values to estimate the variance of the distribution.

  • To avoid confusing mathematical expressions, you only need to know it is equal to the square

  • of the scale, timespisquared, over 3.

  • Great job, everybody!

  • Now that you know all these various types of distributions, we can explore how probability

  • features in other fields.

  • In the next section of the course we are going to focus on statistics, data science and other

  • related fields which integrate probability.

  • Thanks for watching!

This lecture is going to serve as an overview of what a probability distribution is and

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it