Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Now let’s look more formally at how the theoretical regression model is developed in the case of simple linear regression. 1 The following graph will help us illustrate how we develop our mathematical model. Consider that this is our entire population – in other words, this is not a sample but ALL possible values for this situation. Be aware as we develop this THEORETICAL MODEL that once we are working with data we will have slightly different notation indicating that what we have are ESTIMATES of the assumed THEORETICAL MODEL. As we recently mentioned, our model assumes that the MEAN of our outcome, Y, is linearly related to X. The red line indicates this TRUE LINEAR RELATIONSHIP in this tiny population. We need to write our assumed theoretical model in mathematical notation. From algebra we know that the right side of our equation should look something like mX + b. However, in statistics we will use Greek letters and subscripts to denote our coefficients. From statistics we might denote the mean as the Greek letter MU but here we need to indicate that the mean of Y is changing with X so we prefer to use the expected value notation. Let’s reveal this equation and go through it carefully. 2 The notation we will use to represent the equation of the line in the population, our theoretical regression model, is: On the left side of our theoretical model we have E[Y|X] which denotes the expected value of Y given X. This represents the TRUE POPULATION MEAN of Y for a given value of X using statistical notation. On the right side of our theoretical model we have a linear equation where the two coefficients are denoted by the Greek letters Beta_0 and Beta_1. These are our statistical PARAMETERS for the POPULATION • We have E[Y|X] = Beta_0 + Beta_1(X) • The slope must be the coefficient in front of the X which is Beta_1 • The intercept is not in front of the X, which is Beta_0 Notice that the parentheses on the right-side indicate multiplication. Once we have values, we would multiply beta_1 times a specific value of X and then add this result to Beta_0. 3 I will try to reserve the use of brackets such as [Y|X] when we are using some type of mathematical function notation and the Y and X are being “plugged” into the function on the left-side. This is designed to distinguish clearly from situations where we mean multiplication where we will use parentheses. Other materials such as the textbook or other found resources may use parentheses and brackets interchangeably and it is possible that we may do so in some situations for clarity in grouping complex calculations, or other reasons. Try to be careful to understand notation fully. If you are unsure, look at an example we provide with numbers or ask and we will clarify. To review, so far our theoretical regression model assumes that the mean of Y varies linearly with X and can be written as • E[Y|X] = Beta_0 + Beta_1(X) where the intercept in the population is Beta_0 and the slope in the population is Beta_1 The Betas are statistical PARAMETERS but are often also called COEFFICIENTS for their mathematical role in the equation 4 So we have started developing a model which satisfies one of our stated assumptions, that the mean of the outcome, Y, varies linearly with X. Now we need to look at how to incorporate the second assumption, that at each X there is an underlying normal distribution that is the same for all X’s except for the changing mean. For each individual in the population there would be a point on our scatterplot. We are going to look specifically at the one pointed out on this slide. For the i-th individual we denote the values of the predictor and outcome as X_i and Y_i respectively. In order to model this individual observation, we need to specify the individual error, how far this observation’s Y value (represented by the orange dot) is from the mean Y (represented by the red line). We will denote the true value of these errors in the population, using the Greek letter epsilon with an “i" as a subscript. Now we can use the theoretical model for the mean (the red line) combined with the notation for the individual error, to arrive at a theoretical model for the actual Y value of the i-th individual. 5 Our theoretical model for the i-th individual can be written as: • Y_i = Beta_0 + Beta_1(X_i) + Epsilon_i There is a distinct X-VALUE (X_i) and ERROR TERM (Epsilon_i) for each individual providing their exact Y-VALUE (Y_i). The values in this theoretical model represent those for the entire population. • Beta_0 = the intercept in the entire population – this is a PARAMETER • Beta_1 = the slope in the entire population – this is a PARAMETER • Epsilon_i = the true error of the i-th individual – these are also parameters, there is one for each member of the population And of course • X_i = the value of the predictor for the i-th individual and • Y_i = the value of the outcome for the i-th indiviudal Soon we will look at how to estimate the unknown population parameters using data. 6 Our assumption - that the outcome is identically normally distributed except that the mean changes linearly with X - has a few important implications. • Normal distributions are symmetric and since the error term subtracts the mean, the errors must be centered at zero, they have a mean of zero. • Since only the mean changes, the standard deviation of each of these normal distributions is the same and thus the standard deviation of the error term must be constant for all values of X. These will be important for our model diagnostics and validation. Once we have estimated (and verified) this model, we can interpret the slope, consider whether the intercept is meaningful and if so, interpret, estimate predicted values, provide confidence intervals for the mean response for particular values of X, provide prediction intervals for an individual response for particular values of X, discuss statistical significance, discuss the strength of association, and maybe more! You may not yet feel comfortable with all of these ideas and notations. We will be working with deepening your understanding throughout the semester. Do your best and ask questions. Often it will take some time for these ideas to become clear enough that you become more comfortable with them. 7