Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Forecasting wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Presentation 13
Regression Analysis
Regression




In Chapter 15, we looked at associations between two categorical
variables. We will now focus on relationships between two continuous
variables.
Regression is used to describe a relationship between a quantitative
response variable and one or more quantitative predictor variables. In
this class we discuss simple linear regression, which is describing a
linear relationship between a single response Y and a single predictor X.
The approach to this problem is to determine an equation by which the
average value of particular random variable (Y) can be estimated based
on the values of the other variable (X). This problem is called
regression.
Example: Is there a relationship between X= the concentration of iron
in the diet and Y= iron in the blood? If we can determine a
relationship (i.e. an equation) between the two variables, then we
might use this equation
1.
2.
To find the mean concentration of iron in the blood for individuals with
a specific concentration of iron in their diet, e.g. for X=80ppm.
To predict a someone’s concentration of iron in the blood based on the
concentration of iron in his/hers diet.
Some terms and notation

Y = the response variable (dependent variable) and is of primary
interest.

X = the predictor variable (explanatory, or independent variable).



We want to find an equation of the E(Y) in terms of X. We will call this
function regression line of Y on X. This equation is of the form
E(Y)=β0 + β1x
where,
 E (Y) is actually E(Y|X=x), and is the expected value of Y for
individuals in the population with the same particular value of X.
 β0 is the intercept of the straight line (i.e. the value of E(Y) for x=0).
 β1 is the slope of this line. (when do we have β1 =0?)
Once we know the slope and the intercept, then, for a given value of X
we can obtain the expected value of Y. However, we cannot know the
values of β0 and β1 (they are population parameters).
Our goal is to estimate the parameters of the regression line using the
observed data (x1,y1), …, (xn,yn).
What is the first thing we need to check?

A plot such this is called scatterplot and it can be obtained in
Minitab by clicking on Graph/Scatterplot.
The first plot indicates that there is a linear relationship between
these two variables and it is reasonable to proceed with simple
(linear) regression analysis. On the other hand, the data in the
second plot demonstrates a non linear relationship between the X
and the Y variable.
Scatterplot of Y vs X
Scatterplot of Y vs X
25
72
70
20
68
66
15
Y

We first need to determine if it is appropriate to use the linear
regression model. One way to check this is to plot the observed data
pairs (x1,y1), …, (xn,yn).
Y

64
10
62
60
5
58
0
56
60
62
64
66
X
68
70
60
62
64
66
X
68
70
Two assumptions about deviations from the
regression line

Furthermore, in order to make
statistical inferences about the
population, we need to make two
assumptions about how the yvalues vary from the population
regression line:
1.
2.
The general size of the deviation of
the y values from the line is the
same for all values of x (constant
variance assumption).
For any specific value of x, the
distribution of y values is normal.
Simple Regression Model for a Population

The model we are going to use is
y = Mean + Deviation
1.
2.

Mean: in the population is the line E(Y ) = β0 + β1x if the
relationship is linear.
Individual’s deviation = y - mean, which is what is left
unexplained after accounting for the mean y value at that
individual’s x value.
Putting all the assumptions together (linear relation between X and
Y, constant variance and normality) we have that:
yi =β0 + β1xi +εi=E(yi )+ εi,
where εi are assumed to follow a normal distribution with mean 0
and standard deviation σ (i.e. the same s.d. for all i’s).
Regression Line in the Sample

Once we decide that the relationship between X and Y is linear, using
out data set we estimate the parameters of then regression
equation β0 and β1. But how do we estimate the regression line?
Which line is the "optimal"?
Method of Least Squares



We will use the method of least squares to obtain the estimates of
β0 and β1 (i.e. to specify the line).
The idea behind this method is to choose the line that comes as
close as possible to all the data points simultaneously.
The estimates of these parameters are denoted by b0 and b1
respectively and the estimated regression line is
yˆ  b0  b1 x
where y-hat is the estimated value of y for X=x,
b0 = the sample intercept of the linear regression line, and
b1 = the sample slope of the linear regression line.
Method of LS -Deviations from the
Regression Line

The distance between an actual data point, yi and the estimated
line of regression is called a residual (or error). Thus, for an
observation yi in the sample, the residual is
ei  yi  yˆi  yi  (b0  b1 xi ),
where xi is the value of the explanatory variable for the observation.


Therefore, we have a residual for each data point and they are
denoted e1,e2,…,en.
The method of least squares will find the values of b0 and b1
minimizing the sum of the squared residuals,
n
n
i 1
i 1
SSE   ei2   [ yi  (b0  b1 xi )]2 .
Example


The regression line on the previous plot is the dotted line.
Using the regression equation we can predict the average
response value (y-hat) when the predictor variable
assumes some value x. For example, in the dad's - son's
problem, using the data, the estimated intercept and slope
turned out to be b0=3.41 and b1=.97. We can estimate the
average height of man that his dad is 70 in. tall by
yˆ  b0  b1 x  3.41  (.97)70  71.04in.
Notes…



For x=0, y-hat=b0.
The estimated slope, b1, tell us how much of an increase
(or decrease, if negative) there is for y-hat when the x
variable increases by one unit.
You CAN NOT use a regression line to predict the
response for observations that fall outside your predictor
range.
Standard deviation for regression

We can estimate the population standard deviation of y, σ, with
s


Sum of Squared Residuals
SSE


n2
n2
2
ˆ


y

y
 i i
n2
.
This is called the standard deviation for regression and it
roughly measures the average deviation of y values from the mean
(the regression line).
This is a useful statistic for describing individual variation in a
regression problem. Small s indicates that individual data points fall
close to the line, thus, it provides information about how accurately
the regression equation might predict y values.
Example - Height and Weight
Data:
x = heights (in inches)
y = weight (pounds)
of n = 43 male students.
Standard deviation
s = 24.00 (pounds):
Roughly measures, for any
given height, the general size
of the deviations of
individual weights from the
mean weight for the height.
Correlation


Correlation, r, between two quantitative variables is a number that
indicates the strength and the direction of a straight-line relationship.
Some properties of r are:
1. It is always between -1 and 1.
2. The magnitude of the correlation indicates the strength of the
relationship. A correlation of either -1 or +1 indicates that there is a
perfect linear relationship. A correlation of zero means no
relationship.
3. The sign of the correlation indicates the direction of the relationship.
A positive correlation indicates that when one variable increases the
other is likely to increase as well, and a negative correlation
indicates that when one variable increases the other is likely to
decrease.
4. Thus, the sign of r is the same as the sign of b1!
Correlation Examples
R = -.056, No Relationship
Y
20
0
0
-10
-20
-20
20
40
X
60
80
100
0
20
40
X
60
80
100
R = 1, Perfect Positive Relationship
Y
50
-60
100
Y
-40
-20
150
0
200
R = -.921, Strong Negative Relationship
-80
0
0
-100
Y
10
40
20
R = .486, Weak Positive Relationship
0
20
40
X 60
80
100
0
20
40
X 60
80
100
Proportion of Variation Explained by x

Squared correlation, r 2, is between 0 and 1 and indicates the
proportion of variation in the response explained by x.
r2 
SSTO  SSE
SSTO
SSTO = sum of squares total = sum of squared differences between
observed y values and y (sample mean of y’s).
SSE = sum of squared errors (residuals) = sum of squared
differences between observed y values and predicted values based on
least squares line.
Example
Iron Blood
99.02
31.67
73.29
18.71
95.73
23.71
66.49
23.23
59.14
20.79
98.91
25.91
76.40
22.45
…
…
Y = 5.95 + 0.194*X
20
15
10
IronBlood
25
30
Iron Diet
35
Is there a relationship between X= the concentration of iron in the diet and Y= iron
in the blood?
40
60
Regression Summary:
b0 = 5.95
b1 = 0.194 r = .839 r-sq = .703
80
100
IronDiet
120
140
Explanation of Terms
b1 = the sample slope. For every unit increase in X we expect Y to
increase by b1.
Example: For every increase of 1mg of iron in the diet we expect
blood iron to increase by 0.194 mg.
r = the correlation, varies between -1 and 1. A correlation of -1 means
a perfect negative relationship, a correlation of +1 means a perfect
positive relationship. A correlation of zero means no relationship.
Example: Our correlation of 0.839 indicates a strong positive
relationship.
r-sq = the percent of variation in the response variable that is
explained by the predictor.
Example: Our r-sq of .703 means that 70.3% of the individual
variation in blood iron concentration can be explained by iron in the
diet.
Example: Driver Age and Maximum
Legibility Distance of Highway Signs
Study to examine relationship between age and maximum
distance at which drivers can read a newly designed sign.
Average Distance = 577 – 3.01 × Age
Example: Age and Distance Cont.
SSE = 69334
SSTO = 193667
s

SSE
n2
69334
 49.76
28
SSTO  SSE
SSTO
193667  69334

 .642
193667
r2 
s = 49.76 and R-sq = 64.2%. Thus, the average distance from regression
line is about 50 feet, and 64.2% of the variation in sign reading distances
is explained by age.
Inference About Linear Regression
Relationship





Inference about a linear relationship can be evaluated through
inference about the slope, β1.
We will see how to create CI for β1 and how to test whether or
not the β1 is 0.
As in any other type of CI or hypothesis test we will need a sample
estimate of the parameter of interest β1, and the standard error of
this estimate.
These quantities are b1 and se(b1) and you do not need to know
their formulas or how to calculate them from the data. You will
need to know how to get them form the Minitab output and how
to use them.
The results of the CI or hypothesis test analysis are meaningful
only if the assumptions of the regression model are valid. We will
see how we can check them using different plots towards the end
of this chapter.
CI for Slope

A Confidence Interval for a Population Slope β1 is
b1  t  s.e.b1 
*
where the multiplier t * is the value in a t-distribution with
degrees of freedom = df = n – 2, such that the area between –t *
and t * equals the desired confidence level. (Found from Table A.2.)

Interpretation: This CI gives the range of the expected increase of
y for one unit increase in x.
Testing For Significance



How do we test if there is a significant relationship between 2
quantitative variables?
Perform a test of slope!
Ho: β1 = 0 (No relationship)
Ha: β1 ≠ 0 (There is a relationship)
Remember Test Statistic Formula:
Test Statistic 


Sample Est. - Null Value
b -0
 1
t
Null Std. Error
se(b1 )
The test statistic has a t distribution with n-2 df, if the null hypothesis is
true. Thus, p-value=2P( Tdf=n-2> t).
If p-value is less than the critical value (usually .05), then reject the null
hypothesis. Conclude there IS a linear relationship between the two
variables and say whether it is positive or negative depending on the
sign of b1.
Example: Age and Distance (cont)
95% CI for the Slope:
b1  t *  s.e.b1   3.01  2.05  0.4243
 3.01  0.87  3.88 to  2.14 feet
With 95% confidence, we can estimate that in the population of drivers represented
by this sample, the mean sign-reading distance decreases somewhere between
3.88 and 2.14 feet for each one-year increase in age.
If we consider the test H0: b1 = 0 vs Ha: b1  0, we have
t
b1  0  3.0068  0

 7.09, and p - value  0.000
s.e.b1 
0.4243
The p-value suggests that the probability that observed slope could be as far from
0 or farther if there is no linear relationship in population is virtually 0. The
relationship in the sample is significant and represents a real relationship in the
population.
Prediction and Confidence Intervals

A 95% prediction interval estimates the value of y for an individual
with a particular value of x. This interval can be interpreted in two
equivalent ways:
1. It estimates the (central) 95% of the values of y for members of
population with specified value of x.
2. With probability .95, the response of a randomly selected individual
from the population with a specified value of x falls into the 95%
prediction interval.
 A 95% confidence interval for the mean estimates the mean value
of the response variable y, E(Y ), for (all) individuals with a particular value
of x.
You do not need to know the formulas for these intervals, just how to
get them from the Minitab output and how to interpreter them.


For a given x, which interval is wider, PI or CI?
Example: Age and Distance (cont)
Probability is 0.95 that a randomly selected …
21-year-old will read the sign at somewhere between 407 and 620 feet.
30-year-old will read the sign at somewhere between 381 and 592 feet.
45-year-old will read the sign at somewhere between 338 and 545 feet.
With 95% confidence, we can estimate that the mean reading distance of ...
21-year-old is somewhere between 482 and 546 feet.
30-year-old is somewhere between 460 and 513 feet.
45-year-old is somewhere between 422 and 461 feet.
How to Check Conditions for
Simple Linear Regression
1.
Relationship must be linear. If you perform a scatter plot of X and
Y and the relationship is obviously curved then this assumption is
violated.
2.
There should not be any extreme outliers. Check the scatter plot
of X and Y for extreme outlying values.
3.
Constant variance, the standard deviation of the values of y from
the fitted line is the same regardless of the x-variable. Check this
with a scatter plot of residuals versus x. How should it look like?
4.
The residuals are normally distributed. Check this with a
histogram of the residuals. How should it look like? - This
condition can be relaxed if the sample size is large.
Detailed Example:
Suppose we are interested in the relationship between high
school GPA and the amount of sleep a student gets. For 100
students we record their GPA and average hours of sleep.
A. Fit a simple linear regression line to the data.
B. Check the conditions for a hypothesis test and CI of slope.
C. Test to see if there is a significant relationship between the 2
variables.
D. Construct and interpret a 95% CI for the slope.
E. Suppose a student gets 10 hours of sleep. What would their
expected GPA be? Is this a good estimate? Explain in terms of r-sq.
F. Suppose a student gets 18 hours of sleep. Can we predict the
GPA of this student using the regression equation?
A. MINITAB: Fitted Line Plot
GPA = 2.50472 + 0.0615188 Sleep (Hours)
S = 0.293792
R-Sq = 8.0 %
R = 0.282
R-Sq(adj) = 7.0 %
R-Sq = 8.0%
3.8
3.6
b1 = 0.0615
3.4
GPA
3.2
b0 = 2.50
3.0
2.8
S=0.29379
2.6
2.4
2.2
2.0
3.5
4.5
5.5
6.5
7.5
8.5
Sleep (Hours)
9.5
10.5
11.5
B. Check Conditions
2.
3.
From scatter plot seems reasonably linear.
From scatter plot, doesn’t seem like there are any extreme
outliers.
Variance seems
4. Residuals are
approx. normal.
constant along X.
15
Frequency
1.
10
5
0
-1
0
RESI1
1
C. Hypothesis Test for Slope:
Regression Analysis: GPA versus Sleep (Hours)
The regression equation is
GPA = 2.50 + 0.0615 Sleep (Hours)
Predictor
Coef
Constant
2.5047
Sleep
0.06152
S = 0.2938
SE Coef
0.1708
0.02114
R-Sq = 8.0%
T
P
14.67 0.000
2.91 0.004
R-Sq(adj) = 7.0%
Based on the p-value of .004 we can REJECT the null
hypothesis. Conclude there is a significant positive
relationship between GPA and sleep.
D. 95% CI for Slope:
b1 ± t* SE(b1) = .0615 ± 1.99 * .02114
CI = (.0194,.104)
We are 95% confident that the true population
slope is between .0194 and .104. We are 95%
confident that for each additional hour of sleep
the expected GPA will increase between .0194
and .104 units.
Expected GPA
E. Equation is: Y = 2.50 + 0.0615*X
The predicted GPA for 10 hours of sleep is:
2.5 + 0.0615*10 = 3.115
For someone who gets 10 hours of sleep we expect them to have a
GPA of 3.115
This will NOT be a very good predictor because the r-squared value
is only .08. Sleeping hours only explains 8% of the variation in
GPA. Most of the variation in GPA is unaccounted for.
F. We can not use the regression equation to predict the GPA of a
student that sleeps 18 hours per day because 18 hours is not in the
range of the values of X in the data set.
Exercise 14.47: Height and Foot Length.
a.
There is a linear relationship with a
positive slope, and there is an
obvious outlier in the data.
b. With the outlier omitted from the data set, the Minitab regression output is:
The regression equation is
height = ____ + ____ foot
Predictor
Coef
SE Coef
T
P
Constant
30.150
6.541
4.61
0.000
foot
1.4952
0.2351
6.36
0.000
S = 2.029
R-Sq = 57.4%
R-Sq(adj) = 56.0%
What is the regression equation?
What is r?
What is se(b1)?
What is the test statistic for testing the hypothesis that the slope is zero?
Verify the value.
Exercise 14.34
d.
e.
The regression line doesn't provide particularly accurate
predictions of height based on foot length. Notice the standard
deviation from the regression line is given in the output as s =
2.029 inches. This is roughly the average difference between
actual heights and predicted heights determined from the line.
The residual plot shows that a linear equation is probably
appropriate, there are no outliers, and it's reasonable to make the
constant variance assumption (although it may be that there is
less variation among residuals for small foot lengths than for large
foot lengths).