Download 2:2 Simple Linear Regression.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Revised: 5/6/2017
582734890
Chapter 2. Simple Linear Regression.
2:1 Basic statistical definitions and concepts.
2:1.1 Random variables and distributional properties.
Let Y be a random variable Y with probability density function f(Y). The probability density function (pdf)
describes the probability that the variable takes values in any interval (Y, Y+dY). The mean or expectation of Y
is defined as:

EY   Y f (Y ) dy
The expected value of a linear function of random variables is equal to the same linear function of the
expected values of the variables.
Variance of Y is defined on the basis of the definition of expected value:
VY    2 Y    2Y  E(Y  EY )2
This means that the variance of Y (for which there are various notations) is the expected value of the square
of the difference between Y and the expected value of Y.
Covariance is defined for two random variables Y and Z as:
CovY ,Z   Y ,Z  YZ  E(Y  EY )(Z  EZ)
Thus, the variance of a random variable is equal to its covariance with itself. The variance of a linear
combination of random variables is the sum of all pairwise covariances, including that of each variable with
itself, each multiplied by the product of the corresponding coefficients. In this case the equation is clearer:
 n

 1

n
n
  a iYi    ai a j Yi ,Yj 
i 1 j 1
Random variables are said to be independent when the probability of any pair of values (Yi,Zi) is equal to the
product of the probability of a value Yi (regardless of Z) time the probability of a value Zi (regardless of Y).
Independence implies that covariance and correlation between the variables is zero. The converse (0
correlation implies independence) is true only if Y and Z are jointly normally distributed.
The correlation between two random variables is their covariance divided or scaled by the product of their
standard deviations. Equivalently, the correlation is the covariance between the standardized forms of the
random variables:
1
Revised: 5/6/2017
582734890
Y ,Z  y,z cov{y,z} 
 Y ,Z
 Y Z
where
Y  E Y 
Z  E Z
y
, z
 Y
 Z
All of these definitions are for population parameters. For the estimates of these parameters on the basis of
samples, we use the "hat" notation or regular letters such as s, s2, sxy, rxy, etc. Equations to estimate parameters
are the usual computational formulas and are typically based on minimization of variance and bias. Keep in
mind that parameters have set values and no variance, whereas estimated parameters are random variables
with distributions that depend on the distribution of the original population, on the manner in which the
population was sampled, and on the way in which the parameter estimation was obtained. Once again, we
typically use random sampling (all elements of population have the same probability of being selected) and use
linear (e.g. average as estimator of mean, b1 as estimator of 1) and quadratic (s2 as estimator of variance)
functions of the observations.
2:1.2 Logic behind hypothesis testing.
Statistical hypotheses are statements about populations, typically about the values of one or more
parameters or calculations based on the parameters of the population. Hypothesis testing is a means of
evaluating the probability that the statements are correct, and transforming the continuous probabilities in
discrete decisions. As an arbitrary rule accepted by most people, statements that have a probability smaller than
5% are described as “rejected.” It is wise to keep in mind that there is nothing special about the 5% value other
than the fact that most people agree on using it.
Hypothesis can be “rejected” or “not rejected,” but this is not to be taken as disproving or proving the
statements. The proper use of statistical methods will always result in wrong decisions. Nothing can be done to
prevent this. Often, we never find out whether the decisions were actually wrong or right.
There are two types of errors in hypothesis testing: rejection of a correct hypothesis is called error type I and
its probability is ; non-rejection of an incorrect hypothesis is called error type II and it has a probability  .
“Power,” the probability of rejecting a hypothesis given that it is incorrect is 1- . Power depends on a, sample
size, variance, and size of the effect to be detected.
When normality, independence, and homogeneity of variance of the random variables (typically the i’s) can
be assumed, we can make statistical inferences by using known distributions. For example, the sum of the
squares of many independent standard normal variables has a
2 distribution with degrees of freedom (which is
also the expected value for the 2) equal to the number of independent standard normal variables added. The
variance estimated for a normal distribution based on a sample is thus a 2; and this distribution is used to
make inferences about the unknown variance. Similarly, it can be demonstrated that the calculations used to
obtain t and F values should in fact lead to random variables with t and F distributions, provided that the
assumptions and null hypothesis are correct. In general, the popularity of the common parametric statistical test
we use comes form the fact that, under the assumptions of normality, one can analytically derive the
distributions of the random variables resulting from doing the usual calculations to obtain estimated parameters.
The logic behind any typical parametric statistical test (say an F test) is thus as follows. Suppose that you
performed an analysis of variance to determine if two treatments give different results. The null hypothesis is
that the two means are identical. The assumptions are normality with equal variances and independence of
errors. Further, suppose that you find a very large value of F that is “highly significant” (say F=40). The brief
version of the logic is just to say "F is significant, thus the means are different." The long version is:
The probability of
(The model assumptions being true, AND
The null hypothesis being correct, AND
Obtaining a large value of F)
Is extremely low
THEREFORE,
2
Revised: 5/6/2017
582734890
Because the assumptions are met (I checked them), and I did in fact obtain a large F,
I will reject the validity of the null hypothesis.
In other words, the F value and thus the data obtained, and more extreme values of the data are very
unlikely if Ho is true. Therefore, either one observed an unlikely event, or the null hypothesis is false. Note that
although the decision is formally correct because the method was applied correctly, it can still be factually
incorrect if in reality the hypothesis is true (error type I with probability). One can visualize this by imagining
that your professor flips a coin 10 times and obtains head each single time, after placing an abundant wager on
getting all heads. Your choices upon witnessing such coin-flipping prowess are to (a) make no comments, (b)
state that your professor cheated (i.e. used a coin that is not “fair”), or (c) state that your professor is one lucky
devil. Unless you obtain the coin for further inspection you will be left with those choices, and the choice will be
yours.
2:2 Simple Linear Regression.
2:2.1 Regression
The name comes from the first application of the method by Sir Francis Galton, who invented the method.
He studied the height of children as a function of the height of parents. He observed that the height of children
appear to "regress" to the average for the group. He considered this to be a regression to mediocrity
(http://www.mugu.com/galton/index.html).
2:2.2 Uses of regression.
Regression analysis serves three main purposes that in practice tend to overlap:

Description and understanding of the relationship between Y and X. For example, one may be
interested in determining whether temperature affects growth rate of plants and how much
growth rate changes per degree of temperature. The estimated parameters have to make sense
mechanistically or biologically. We are interested in the value of parameters and in the potential
effects of X on Y.

Control or management of a process. In the example above, the relation between growth rate
and temperature may be studied in a greenhouse so that temperature can be adjusted to
approach a desired growth rate.

Prediction of the value of an unmeasured variable on the basis of measurement of the X
variables. In the example above, the effects of temperature on growth rate can be modeled
through regression analysis and the model can be applied to predict how much growth should be
expected in greenhouses where temperature is measured but growth is not measured. For
purely predictive purposes, we are not concerned with the mechanistic interpretation of
parameters. We want to predict Y with the greatest precision.
Although these uses can overlap, the success in using regression for each one can differ for a given data
set. For example, a model and data set can be great for prediction, but very poor for description and
understanding. This is the case in multiple linear regression when the X variables are highly correlated, a
situation called "collinearity" or "multicollinearity" that we will explore in detail in later chapters.
2:2.3 Model and Assumptions.
The SLR model can be stated in regular notation as:
Yi  o  1Xi  i
Where the subscript i identifies each of the n observations, 0 and 1 are the parameters (intercept and
slope, respectively), Xi is a constant of known value for each observation, and each
i is a random variable with
mean 0, variance 2, and uncorrelated with any other i. Note that there are as many error random variables as
observations.
3
Revised: 5/6/2017
582734890
If one adds the assumption that these random variables are all normally distributed, then it is possible to use
the traditional methods for inference based on the normal and related distributions. However, the equations that
describe the variance of the estimated parameters and their expectations are valid regardless of the shape of
the distribution of errors. Likewise, the estimation of parameters by minimization of sum of squares of the errors
(SSE) is unbiased and has minimum variance, regardless of the distribution of errors.
2:2.4 Estimation of parameters.
2:2.4.1
OLS: minimization of SSE. Normal equations.
Parameters are estimated such as to minimize the sum of squares of errors (SSE) or deviations around the
linear model. This is known as Ordinary Least Squares. Minimization of SSE is achieved by application of
calculus. Use of calculus is not necessary, as one could get as close to the correct solution as desired simply by
trial and error. However, use of calculus saves time. SSE are minimized by finding the values of estimated
parameters for which the partial derivatives of SSE with respect to each parameter are zero. This calculation
leads to a set of two equations called Normal Equations. In multiple linear regression there will be as many
equations as parameters need to be estimated. The normal equations are important because they are the basis
for Path Analysis, which we will study in the second half of the course.
Y  n b  b  X
 XY  b  X  b  X
0
1
0
2
Normal Equations
1
The estimated parameters b0 and b1 are obtained by solving the simultaneous equations for b o and b1. As a
result of the calculations, the sums of all residuals or errors is always zero; the sum of the fitted values Yhat
equals the sum of observed values Y; and the line goes through the point (Xbar, Ybar).
2:2.4.2
Regression coefficients.
The slope is calculated with the following equation.
b1 
(X  X )(Y  Y )
(X  X )2
This equation can also be used to show that b1 is a linear function of the observed values of Y. This fact is
used to derive the variance and expected value for b1. The intercept is calculated based on the fact that the line
goes through the point defined by the averages of X and Y.
r
K
4
Revised: 5/6/2017
582734890
Units: in most cases Y, X & ‘s and estimated parameters are quantities. This means that they are
composed of numbers and units. If the units or the numbers are omitted, a great deal of ambiguity is introduced
in the analysis.
Consider an example in which per capita population growth rate of aphids is regressed on population
density. Per capita growth rate (Y) is the number of descendants produced per individual per year. It is
measured as individuals yr-1 individual-1. Population density (X) is measured in individuals m -2. The figure
without units or numbers only shows that Y declines with increasing X, but there is no indication of the range of
values, and the results cannot be compared with other studies. Moreover, the regression will always yield some
values for the estimated parameters, but the use of these values in population models can be correct only if one
knows the units of the parameters. Units for 0 are individuals yr-1 individual-1, whereas the units for 1 are yr-1
m2 individual-1.
2:2.4.3
Estimated variances.
Because b1 is a linear combination of Yi’s, it is also a normally distributed variable.
b1 ~ N (1, 2 {b1})
 2 {b1}   2
1
2
(Xi  X )
The estimated variance is, therefore:
MSE
(Yi  Yˆi )2 (n  2)
S {b1} 

(Xi  X )2
(X i  X )2
2
The MSE is the SSE divided by the number of degrees of freedom. In general, for any model, number of
degrees of freedom of the error equals the number of observations minus the number of estimated parameters.
In this case the estimated parameters for the model are the slope and the intercept.
Similarly:

 2 { b0 }   2 1 

n
S 2 {b 0 }  MSE
2

X
( X  X ) 2 

1

X2

2 

n ( X  X ) 
2:2.5 Confidence intervals.
Confidence interval for slope:
if i ~ N ind (0, 2) then
b1  1
~ t with n  2 df .
S
5
582734890
Revised: 5/6/2017
The interval (CI) that, on average, will contain the true value of the parameter 100-% out of 100 times it is
calculated is:
CI  b1  sb1 t n2,1

2
2:2.6 Prediction of expected and individual value for a given X h.
Typically, we are interested in making a prediction for, or estimating the expected value of the response
variable Y for a given value of the predictor variable X. We calculate a confidence interval for the prediction
called “Y hat, given that X=Xh”.

A confidence interval for E{Yh} is estimated by using
Yˆh  b0  b1 X h
 2{Yˆh }   2 {Y  b1(X  X )}
Because
Y
and b1 are independent we can write:

 2{Yˆh }   2{Y }  (X  X )2 2{b1}

1 (Xh  X )2 
ˆ
S {Yh }  MSE  
2 
n (X  X ) 
2
The term 1/n factors in the variance due to unknown E{ Y }, whereas the second term in the brackets
reflects the variance due to unknown E{b1}. Inspection of this equation indicates that variance of the predicted
expected Y increases with the square of the distance from Xh to the average of X.
The confidence interval for the prediction is calculated in the usual fashion:
CI  Yˆh  sYh  t( n  2; 1  2)
In certain cases one may be interested in calculating a CI for the result of individual observations, instead of
the expected value of repeated observations. This is called a “prediction for a new observation”. The
 h , but the variance is greater, because one must add the deviations of
 Y
expectation is the same as for
individual observations from the mean.
h }
2{prediction} = 2 + 2 { Y
6
Revised: 5/6/2017
582734890

 1
(X  X)2
2
S { pred }  MSE 1  
n ( X  X ) 2








The number 1 in the brackets reflects the uncertainty about the value of a random sample of size 1, even
when it is taken from a distribution of known mean. The rest of the terms reflect the uncertainty about the value
of the mean of the population, which is given by the uncertainty about the intercept and slope, as explained
above.
2:2.7 ANOVA table
The total variation or SS of Y around its mean (SSTO) can be partitioned into two components: the variance
explained by the model or sum of squares of the regression (SSR), and the unexplained variation or sum of
squares of the errors (SSE). Similarly, the total deviation or difference between each observed value of Y and
the average for Y can be partitioned into a deviation between the regression and the average Y, and a deviation
between the observed value and the predicted value.
SSTO  SSR  SSE
SSTO   (Y  Y ) 2 totaldf  n  1
SSR   (Yˆ  Y ) 2 df  1
SSE   (Y  Yˆ) 2
dfe  n  2
The coefficient of determination r2 is the proportion of all variation represented by the regression sum of
squares, and it is a general indicator of the adequacy of the model. However, the complete adequacy of the
model cannot be inferred just on the basis of r2, because models that are clearly non-linear can yield large
coefficients of determination. In SLR, the coefficient of determination equals the square of the correlation
coefficient.
2:2.7.1
Degrees of freedom.
What are the degrees of freedom? Why are they called “degrees of freedom?” Degrees of freedom can be
understood at least in two ways, with and without linear algebra. David Lane (2001) provides the following
explanation in his HyperStat site:
7
582734890
Revised: 5/6/2017
“Estimates of parameters can be based upon
different amounts of information. The number of
independent pieces of information that go into the
estimate of a parameter is called the degrees of
freedom (df). In general, the degrees of freedom
of an estimate is equal to the number of
independent scores that go into the estimate minus
the number of parameters estimated as
intermediate steps in the estimation of the
parameter itself. For example, if the variance, 2 ,
is to be estimated from a random sample of N
independent scores, then the degrees of freedom
is equal to the number of independent scores (N)
minus the number of parameters estimated as
intermediate steps (one,  is estimated by M) and
is therefore equal to N-1.”
http://davidmlane.com/hyperstat/A42408.html
For most students, degrees of freedom are like the cashiers at the grocery store, you see them several
times a week, they help you fulfill some of your needs, but you do not really know them. The problem with
teaching the details of degrees of freedom is that the concept is strongly dependent on linear algebra and
hyperspaces, which are not in the statistical toolbox of most students. We will use a somewhat operational
explanation of the concept of degrees of freedom (df).
Degrees of freedom is a number associated with a sum of squares (SS). The value of this number is equal
to the total number of independent observations (more correctly, the number of independent random variables)
used to calculate the sum of squares minus the total number of independent parameters estimated and used to
calculate the SS. This operational definition is easiest to apply to the SS of the residuals (SSR) for any model.
For example, in a completely randomized design with n observations and 4 fixed treatments the SSE has n
terms in the summation, and one uses 4 estimated parameters, one for the mean of each treatment. Therefore,
dfe=n-4. When calculating the SS of the model, in this case there are four random variables (the averages for
the 4 treatments), and one estimation for the overall mean. Therefore, df of treatments is 3. In the case of
regression, where there are no discrete treatments but continuously varying predictors, it is easier to calculate
the df for the model as the difference between the df for the total SS (n-1) and the dfe. For a variety of
explanations of the concept of degrees of freedom visit Dr. C. H. Yu’s site and scroll to the last part of the
document.
Finally, the name “degrees of freedom” comes from the fact that df is the number of dimensions in which a
vector (whose squared length is the SS under consideration) can roam “freely.” Degrees of freedom are the
dimensions of the space on which the observation vector is projected. For this to make more sense, consider a
sample of 10 (X,Y) pairs and the regression of Y on X. Visualize each pair or observation as one dimension.
The sample forms a 10-dimensional space. The vector of ten Y values is a line in that space, as is the vector of
X’s. The regression consists of projecting 10-D vector Y on 2-D space spanned by X and the unit vector. The
component of vector Y that is not contained in that projection is perpendicular (orthogonal) to the model space
and exists in the other 8 dimensions.
2:2.8 Complete example with Excel
The file xmpl_PfertParSim.xls contains a simulated dataset to explore the concepts of distribution of
samples and estimated parameters. This example shows the application of SLR in a situation with replicate
values of Y for each X. The data are simulated and the true model is known to be quadratic , although it can be
made effectively linear by specifying a coefficient identical to zero for the quadratic term (b 2=0). Excel is used to
obtain random data sets that meet all assumptions of SLR, except for the linearity of the model when b 2 is not
zero. An ANOVA table is calculated including terms for the SSLOF and pure error. The worksheet can be easily
8
582734890
Revised: 5/6/2017
modified to perform Monte Carlo analysis to simulate the distribution of estimated parameters. This example can
be used together with the applets available at the Interactive Regression Simulations web site. The excel
spreadsheet can show you the details of what happens in each sampling event simulated by the web site. The
website allows you to easily obtain histograms for observed distributions of estimated parameters and other
statistics based on hundreds of samples. It is recommended that you explore the example in the Excel file by
recalculating the sheet a few times and examining the formulas. Then, move to the web site and set the
parameters to the same values used in the spreadsheet.
In order to use the xmpl_PfertParSim.xls file for this example, make sure that cell M4 has a zero in it. The
columns in the spreadsheet are as follows:
A.
P added: numbers of units of phosphorus added to the crop.
B.
Yield: number of units of crop yield observed in a sample.
C.
Ybar: overall average for crop yield over all observations in the sample.
D.
Ybari: average crop yield within each level of P applied.
E.
Yhat: estimated crop yield based on the linear regression of Y on X.
F.
E{Y|X}: expected value of crop yield for each level of X based on known slope and
intercept.
G. Lofit: Lack of fit; difference between Ybari and Yhat.
2:2.8.1
H.
Puree: pure error; difference between observed yield and Ybari.
I.
Totale: total error; difference between observed yield and Yhat.
J.
Truee: difference between the observed yield and E{Y|X}; true total error.
True model
The true model is known in this simulation. This differs from most real situations in which we do not know the
real model, so the shape of the model (function used) is selected a priori and the parameters are estimated by
least squares. Because we know the model, we can obtain and study the observed distributions of the
estimated parameters.
Y   0  1 X   2 X 2  
i is Normal (0,  )
The true model is a polynomial of degree smaller than or equal to 2. For this example, the coefficient for the
quadratic term is set to 0, so the true model is effectively linear.
2:2.8.2

First point: ei is not
The first concept illustrated by this example is that the residual ei calculated as the difference between
estimated and observed value is not the true realization of the random variable i. The residual is calculated b
using the estimated expected value of Y instead of the true value, which in a real situation is unknown. Compare
the Totale column with the Truee columns and the sums of their values. Recalculate the sheet several times
while keeping our eyes on these columns. Note that the sum of the realizations of the true error is not
necessarily zero, whereas the sum of Totale is always zero, because of the way we estimate the parameters by
minimization of the sums of squares of e. As an exercise, guess the result of averaging the sums of Truee for
many samples (say 20) and then check what actually happens.
2:2.8.3
Second point: the estimated parameters are correlated random variables
Perform a few simulations and copy the values of b0 and b1 to an area of the worksheet that is not used.
For this, recalculate the sheet to get a new random sample, select the range D31:E31, Edit Copy, select a blank
cell with 20 empty spaces below it and Edit Paste Special… Values. Repeat the procedure more than 20 times.
(You can write a macro to do this for you). After you have more than 20 pairs b0-b1, do a scatter plot of b0 vs.
b1. You should observe that they seem to be correlated. Is the correlation positive or negative? Could it have
the opposite sign?
9
582734890
Revised: 5/6/2017
Once you understand where the replicates of the estimated parameters come from, you can use the
following web site to do more numerous simulations quickly and to seriously study the correlation between
parameters. As an exercise, follow the link to:
http://www.kuleuven.ac.be/ucs/java/version2.0/Content_Regression.htm and click on “Histograms of slope and
intercept.” Study the effects of the settings, particularly sample size, on the histograms. Make sure you
understand the difference between Sample size and Number of samples. Sample size is the number of points
“measured” in each sample simulated. Number of samples if the number of times that the sampling is simulated.
You will obtain as many values for each estimated parameter as the number of samples. After you explore the
histograms, select the option to study the correlation between parameters and try to obtain correlations that are
both negative and positive by changing the parameters of the true model and the variance of the errors.
10