Placeholder Image

Subtitles section Play video

  • In this video, we'll introduce terms and notation that we'll use throughout this course.

  • Let's start with variable types.

  • We'll compare and contrast two pairs of variable types.

  • Here's the first.

  • The first pair is response variable versus explanatory variable.

  • The analyst is primarily interested in the response variable.

  • We want to know if, and how, we can understand the response variable better using other variables.

  • In the Commute and Chris setup, questions 1 and 2 have commute as the response variable.

  • Chris hopes to understand how commute is affected by other variables.

  • The response variable goes by other names, like output variable and dependent variable.

  • In contrast, an explanatory variable is any variable used to study the response variable.

  • The goal here is to find potential relationships between the response variable and an explanatory variable.

  • In the Commute and Chris setup, question 1 uses departure as an explanatory variable to analyze commute.

  • We also call explanatory variables input variables or independent variables.

  • Sometimes, we may even call them predictors or features.

  • Even though we can call response and explanatory variables dependent and independent variables, they are conceptually different from the independent random variables that we encountered in probability.

  • Okay, that's our first pair.

  • Our second variable type pair is a quantitative variable versus a qualitative variable.

  • As the name suggests, quantitative variables take on quantities, which we can divide into two main groups, which creates count variables and continuous variables.

  • Count variables take on non-negative integers, while continuous variables take on values from an interval.

  • We commonly call qualitative variables as categorical variables.

  • These variables take on a small number of possible categories, also known as classes or levels.

  • It's common that we'll assign numbers to the categories, but this does not convert the variables into count variables.

  • Okay, let's review the commute and Chris setup and categorize the eight variables into count, continuous, and categorical variables.

  • Commute is measured in minutes.

  • Because any value exceeding zero is possible, commute is a continuous variable.

  • For similar reasons, departure, temp, and precip chance are also continuous variables.

  • They just take on values from different intervals.

  • Next, precip, season, and accident all take on two or four possible outcomes, making them categorical variables.

  • Last is police, which takes on non-negative integers, making it a count variable.

  • We can subdivide categorical variables further into nominal and ordinal variables.

  • If there's not a meaningful order to the categories, then it's a nominal variable.

  • In the case where we assign numbers to categories, the numbers only act as labels.

  • If there is a meaningful order to the categories, then it's an ordinal variable.

  • In the case where numbers are assigned to the categories, the numbers communicate the order.

  • Let's use season from the commute and Chris setup as an example.

  • If we assign 1 to winter, 2 to spring, 3 to summer, and 4 to fall, then the categories follow the calendar seasons in sequence and therefore have meaningful order.

  • This makes season an ordinal variable.

  • If instead we assign numbers based on alphabetical order of the seasons, then we do not have meaningful order to the categories.

  • And this makes season a nominal variable.

  • Now that we've explored variable types, let's establish basic notation that we'll use throughout this course.

  • We denote variables in general by the letter x.

  • If there are multiple variables, we use the subscript j to distinguish between variables.

  • However, it's common to use the letter y to denote response variables.

  • We use p to represent the number of variables in a dataset, excluding the response variable if there is one.

  • This means j can take on integer values from 1 to p.

  • For example, x sub 2 represents the second explanatory variable.

  • Now, what if we want to refer to a specific observation of a variable?

  • We use subscript i for this, and the letter n represents the total number of observations in the dataset.

  • This means i can take on integer values from 1 to n.

  • Using the commute and Chris scenario as an example, the fifth observation contains values from the fifth recorded day.

  • These include y sub 5, the response variable data point recorded on that day, and x sub 5 comma 1 through x sub 5 comma p, data points for the explanatory variables from the same day.

  • But a word of caution about subscripts.

  • When x has two numbers in its subscript, the first number is i, the second number is j, as we have shown.

  • However, if x has only one number in its subscript, it can be either i or j.

  • So, how can we identify which is which?

  • Well, it depends on the context.

  • Make sure you read carefully.

  • In general, if there is only one x variable, that is, p equals 1, then there is no purpose for subscript j.

  • So, in this case, the one number in the subscript is usually i.

  • However, if there are multiple x variables, then the one number in the subscript is usually j.

  • You may recall from a probability course that we use uppercase letters to represent random variables, such as capital X and capital Y.

  • We can add subscripts to these letters the same way.

  • Introducing subscript i adds clarity, but it can also make equations or expressions messy and difficult to read.

  • But we combat this problem by moving to vector and matrix notations.

  • If we use a matrix to represent a data set, rows represent the observations, while columns represent the variables.

  • We'll see more of this in future sections, but for now, let's review some basic facts about matrices.

  • First, for matrix A, A superscript T is A's transpose.

  • Transposing simply means swapping the rows and columns so that the k-th column becomes the k-th row, and vice versa.

  • Notice transposing a matrix also reverses its dimensions.

  • If A is an A by B matrix, then A transpose is a B by A matrix.

  • And second, A superscript negative one is A's inverse.

  • If we multiply a matrix by its inverse in any order, we will get the identity matrix.

  • Note that the identity matrix has ones in its diagonal and zeros elsewhere.

  • One of the enemies in this course is confusion.

  • We'll try to minimize confusion by using clear and consistent notation.

  • However, don't assume that the conventions that we use here are universal.

  • Remember, notation only represents concepts.

  • However, authors may use different notation to suit their needs.

  • They may even use the same notation for different but similar concepts.

  • So, train yourself to distinguish the concept from the notation.

In this video, we'll introduce terms and notation that we'll use throughout this course.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it