Download Now let`s look more formally at how the theoretical regression model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

History of numerical weather prediction wikipedia , lookup

Regression analysis wikipedia , lookup

Computer simulation wikipedia , lookup

Theoretical ecology wikipedia , lookup

Least squares wikipedia , lookup

Beta (finance) wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
Now let’s look more formally at how the theoretical regression model is developed in
the case of simple linear regression.
1
The following graph will help us illustrate how we develop our mathematical model.
Consider that this is our entire population – in other words, this is not a sample but
ALL possible values for this situation.
Be aware as we develop this THEORETICAL MODEL that once we are working
with data we will have slightly different notation indicating that what we have are
ESTIMATES of the assumed THEORETICAL MODEL.
As we recently mentioned, our model assumes that the MEAN of our outcome, Y, is
linearly related to X. The red line indicates this TRUE LINEAR RELATIONSHIP in
this tiny population. We need to write our assumed theoretical model in
mathematical notation. From algebra we know that the right side of our equation
should look something like mX + b. However, in statistics we will use Greek letters
and subscripts to denote our coefficients.
From statistics we might denote the mean as the Greek letter MU but here we need
to indicate that the mean of Y is changing with X so we prefer to use the expected
value notation. Let’s reveal this equation and go through it carefully.
2
The notation we will use to represent the equation of the line in the population, our
theoretical regression model, is:
On the left side of our theoretical model we have E[Y|X] which denotes the
expected value of Y given X. This represents the TRUE POPULATION MEAN of Y
for a given value of X using statistical notation.
On the right side of our theoretical model we have a linear equation where the two
coefficients are denoted by the Greek letters Beta_0 and Beta_1. These are our
statistical PARAMETERS for the POPULATION
• We have E[Y|X] = Beta_0 + Beta_1(X)
• The slope must be the coefficient in front of the X which is Beta_1
• The intercept is not in front of the X, which is Beta_0
Notice that the parentheses on the right-side indicate multiplication. Once we have
values, we would multiply beta_1 times a specific value of X and then add this result
to Beta_0.
3
I will try to reserve the use of brackets such as [Y|X] when we are using some type of
mathematical function notation and the Y and X are being “plugged” into the function on the
left-side. This is designed to distinguish clearly from situations where we mean multiplication
where we will use parentheses.
Other materials such as the textbook or other found resources may use parentheses and
brackets interchangeably and it is possible that we may do so in some situations for clarity
in grouping complex calculations, or other reasons.
Try to be careful to understand notation fully. If you are unsure, look at an example we
provide with numbers or ask and we will clarify.
To review, so far our theoretical regression model assumes that the mean of Y varies
linearly with X and can be written as
• E[Y|X] = Beta_0 + Beta_1(X) where the intercept in the population is Beta_0 and the
slope in the population is Beta_1
The Betas are statistical PARAMETERS but are often also called COEFFICIENTS for their
mathematical role in the equation
4
So we have started developing a model which satisfies one of our stated assumptions, that
the mean of the outcome, Y, varies linearly with X. Now we need to look at how to
incorporate the second assumption, that at each X there is an underlying normal
distribution that is the same for all X’s except for the changing mean.
For each individual in the population there would be a point on our scatterplot. We are
going to look specifically at the one pointed out on this slide.
For the i-th individual we denote the values of the predictor and outcome as X_i and Y_i
respectively. In order to model this individual observation, we need to specify the individual
error, how far this observation’s Y value (represented by the orange dot) is from the mean
Y (represented by the red line).
We will denote the true value of these errors in the population, using the Greek letter
epsilon with an “i" as a subscript.
Now we can use the theoretical model for the mean (the red line) combined with the
notation for the individual error, to arrive at a theoretical model for the actual Y value of the
i-th individual.
5
Our theoretical model for the i-th individual can be written as:
•
Y_i = Beta_0 + Beta_1(X_i) + Epsilon_i
There is a distinct X-VALUE (X_i) and ERROR TERM (Epsilon_i) for each individual
providing their exact Y-VALUE (Y_i).
The values in this theoretical model represent those for the entire population.
• Beta_0 = the intercept in the entire population – this is a PARAMETER
• Beta_1 = the slope in the entire population – this is a PARAMETER
• Epsilon_i = the true error of the i-th individual – these are also parameters, there
is one for each member of the population
And of course
• X_i = the value of the predictor for the i-th individual and
• Y_i = the value of the outcome for the i-th indiviudal
Soon we will look at how to estimate the unknown population parameters using
data.
6
Our assumption - that the outcome is identically normally distributed except that the mean
changes linearly with X - has a few important implications.
• Normal distributions are symmetric and since the error term subtracts the mean, the errors
must be centered at zero, they have a mean of zero.
• Since only the mean changes, the standard deviation of each of these normal distributions
is the same and thus the standard deviation of the error term must be constant for all values
of X.
These will be important for our model diagnostics and validation.
Once we have estimated (and verified) this model, we can interpret the slope, consider
whether the intercept is meaningful and if so, interpret, estimate predicted values, provide
confidence intervals for the mean response for particular values of X, provide prediction
intervals for an individual response for particular values of X, discuss statistical significance,
discuss the strength of association, and maybe more!
You may not yet feel comfortable with all of these ideas and notations. We will be working with
deepening your understanding throughout the semester. Do your best and ask questions.
Often it will take some time for these ideas to become clear enough that you become more
comfortable with them.
7