Download Lecture 18 Review: Linear Regression and Correlation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Transcript
STAT 651
Lecture #19
Copyright (c) Bani K. Mallick
1
Topics in Lecture #19

Are Y and X related?

Inference about a population slope

Residual plots to test for normality
Copyright (c) Bani K. Mallick
2
Book Chapters in Lecture #19

Chapter 11.3
Copyright (c) Bani K. Mallick
3
Relevant SPSS Tutorials


Simple linear regression
Residual plots (they do it slightly differently
from what I do, but their method is OK as
well)
Copyright (c) Bani K. Mallick
4
Lecture 18 Review: Linear
Regression and Correlation

Linear regression and correlation are aimed at
understanding how two variables are related

The variables are called Y and X

Y is called the dependent variable

X is called the independent variable

We want to know how, and whether, X
influences Y
Copyright (c) Bani K. Mallick
5
Lecture 18 Review: Linear
Regression and Correlation


Let Y = GPA, X = height
A linear prediction equation is a line, such as
ˆ=
ˆ0  
ˆ1 X
Y

The intercept of the line = ̂ 0

The slope of the line = ̂1
Copyright (c) Bani K. Mallick
6
Lecture 18 Review: Linear
Regression and Correlation

The basic tool of regression is the
scatterplot

This simply plots the data in a graph

X is along the horizontal (or X) axis

Y is along the vertical (or Y) axis
Copyright (c) Bani K. Mallick
7
Lecture 18 Review: Linear
Regression and Correlation

The usual method, called least squares, tries
to make the squared distance between the
line and the actual data as small as possible
Yi , for i  1, ..., n

The data are

ˆ0  
ˆ1 X i
Any line is 


  Y  ˆ
n

ˆ1 X i 


i 1
The slope & intercept are chosen to minimize
this total squared distance
Total squared distance is
Copyright (c) Bani K. Mallick
i
0
8
2
Lecture 18 Review: Linear
Regression and Correlation
  Y  ˆ
n



Total squared distance is
i
i 1
0

ˆ1 X i 


The slope & intercept are chosen to minimize
this total squared distance
The slope is
ˆ1 

  Y  Y  X
n
i 1
i
 X
n
i 1
i
i
X
X


2
Copyright (c) Bani K. Mallick
9
2
Lecture 18 Review: Linear
Regression and Correlation
  Y  ˆ
n







ˆ


X
1
i
Total squared distance is

i 1
The slope & intercept are chosen to minimize
this total squared distance
i
The intercept is
0
ˆ0  Y  
ˆ1 X

This is algebra! The estimates are called the
least squares estimates
SPSS calculates these automatically
Copyright (c) Bani K. Mallick
10
2
Lecture 18 Review: Linear
Regression and Correlation

The population parameters  0 and 1
are simple the least squares estimates
computed on all the members of the
population, not just the sample

Population parameters: 0 and 1

ˆ0 and 
ˆ1
Sample statistics: 
Copyright (c) Bani K. Mallick
11
Lecture 18 Review: Linear
Regression and Correlation

Formally speaking, the linear regression
model says that Y and X are related:
Y = 0  1 X  



The meaning of the line
0  1 X
is:
Take the (sub)population all of whom have
independent variable value X
The mean of this (sub)population is 0  1 X
Copyright (c) Bani K. Mallick
12
Lecture 18 Review: Linear
Regression and Correlation
Y = 0  1 X  



Assumption #1: A straight line really fits
the data
Assumption #2: The errors  are at least
vaguely normally distributed
Assumption #3: The errors  have
somewhat the same variances
Copyright (c) Bani K. Mallick
13
Inference About the Population
Slope and Intercept
Y = 0  1 X  

If
1  0
then we have a graph like this:
0  1 X
X
Copyright (c) Bani K. Mallick
14
Inference About the Population
Slope and Intercept
Y = 0  1 X  

If
1  0
then we have a graph like this:
This is the mean
of Y for those
whose
independent
variable is X
0  1 X
X
Copyright (c) Bani K. Mallick
15
Inference About the Population
Slope and Intercept
Y = 0  1 X  

If
1  0
then we have a graph like this:
0  1 X
Note how the mean
of Y does not depend
on X: Y and X are
independent
X
Copyright (c) Bani K. Mallick
16
Linear Regression and Correlation
Y = 0  1 X  



If
1  0
then Y and X are independent
So, we can test the null hypothesis H0 :
that Y and X are independent by testing
H0 :1  0
The p-value in regression tables tests this
hypothesis
Copyright (c) Bani K. Mallick
17
GPA and Height
4.5
Grade Point Average (GPA)
Note how
GPA’s
generally get
lower as
height
increases:
data do not
fall exactly on
a line
4.0
3.5
3.0
2.5
2.0
1.5
55
60
65
70
75
80
Height in inches
Copyright (c) Bani K. Mallick
18
Linear Regression and Correlation

There is an ANOVA Table summarizing the
model
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
1.954
24.294
26.247
df
1
98
99
Mean Square
1.954
.248
F
7.881
Sig.
.006 a
a. Predictors: (Constant), Height in inches
b. Dependent Variable: Grade Point Average (GPA)
Copyright (c) Bani K. Mallick
19
Linear Regression and Correlation

Intercept = 5.529 & Slope = -0.0372 are in
“B” column
Coefficientsa
Model
1
(Constant)
Height in inches
Unstandardized
Coefficients
B
Std. Error
.897
5.529
-3.72E-02
.013
Standardi
zed
Coefficien
ts
Beta
-.273
t
6.165
-2.807
Sig.
.000
.006
a. Dependent Variable: Grade Point Average (GPA)
Copyright (c) Bani K. Mallick
20
Linear Regression and Correlation
Y = 0  1 X  


The standard deviation of the errors  is to be
called 
This means that every subpopulation who
share the same value of X have

Mean = 0  1 X

Standard deviation = 
Copyright (c) Bani K. Mallick
21
Linear Regression and Correlation
Y = 0  1 X  


The least squares estimate ̂1
variable
is a random
What does this mean?
Copyright (c) Bani K. Mallick
22
Linear Regression and Correlation
Y = 0  1 X  

The least squares estimate ̂1
variable

What does this mean?



is a random
If you do another experiment, you will
get another least squares estimate
We have to quantify its variability
̂1
Sound familiar: remember the mean (not the
Maine)
Copyright (c) Bani K. Mallick
23
Linear Regression and Correlation

The least squares estimate ̂1 is a random
variable

Its standard deviation is
ˆ1 ) 
s.e.(

n
2
(X

X)
 i
i 1
Copyright (c) Bani K. Mallick
24
Linear Regression and Correlation

Recall that
ˆ1 ) 
s.e.(

n
2
(X

X)
 i
i 1

Also recall that sample variance of X’s is
n
2
(X

X)
/(n  1)
 i
i 1

Thus,
ˆ1 ) 
s.e.(

(n  1)  (sample var of X ' s)
Copyright (c) Bani K. Mallick
25
Linear Regression and Correlation

Recall that
ˆ1 ) 
s.e.(

(n  1)  (sample var of X ' s)


Thus, you can make the sample slope a more
precise estimate of the population slope in
two way

Increase the sample size (like, duh)

Make the X’s more variable
Copyright (c) Bani K. Mallick
26
Linear Regression



In order to compute a standard error for the
least squares slope, we have to provide an
estimate of the common standard deviation 
Define the residuals to be the difference
between the actual data and the predicted
ˆ0  
ˆ1 X i )
ri  Yi  (
line:
These can be calculated automatically in
SPSS
Copyright (c) Bani K. Mallick
27
Linear Regression


Define the residuals to be the difference
between the actual data and the predicted
ˆ0  
ˆ1 X i )
line: ri  Yi  (
Sum of Squares due to Error (SSE), or
Residual Sum of Squares (RRS), is
SSE 
n
2
r
i
i 1
Copyright (c) Bani K. Mallick
28
Linear Regression

Sum of Squares due to Error (SSE), or
Residual Sum of Squares (RRS), is
SSE 
n
2
r
i
i 1

The Mean Squared Error (MSE) has n-2
degrees of freedom and equals
n
MSE 
Copyright (c) Bani K. Mallick
2
r
i
i 1
n2
29
Linear Regression

The Mean Squared Error (MSE) is
n
MSE 

The estimated s.d. is
2
r
i
i 1
n2
s   MSE
Copyright (c) Bani K. Mallick
30
Linear Regression



Remember, each value of X gives rise to a
subpopulation
Each subpopulation has (sub)population
standard deviation 
The estimate of this standard deviation is
s
Copyright (c) Bani K. Mallick
31
Linear Regression and Correlation

The least squares estimate ̂1 is a random
variable

Its estimated standard deviation is
ˆ1 ) 
s.e.(
s
n
2
(X

X)
 i
i 1
Copyright (c) Bani K. Mallick
32
Linear Regression and Correlation

The (1a100% Confidence interval for the
population slope is
ˆ1  t a /2 (n  2)se(
ˆ1 )



If the confidence interval is from 3 to 6, what
does this mean?
Copyright (c) Bani K. Mallick
33
Linear Regression and Correlation





The (1a100% Confidence interval for the
population slope is
ˆ1  t a /2 (n  2)se(
ˆ1 )

If the 95% confidence interval is from 3 to 6,
what does this mean?
The population slope is between 3 and 6
with 95% probability
Are Y and X independent?
Copyright (c) Bani K. Mallick
34
Linear Regression and Correlation



The (1a100% Confidence interval for the
population slope is 
ˆ1  t a /2 (n  2)se(
ˆ1 )
If the 95% confidence interval is from 3 to 6,
what does this mean?
The population slope is between 3 and 6
with 95% probability

Are Y and X independent?

No, since we have ruled out that 1  0
Copyright (c) Bani K. Mallick
35
Linear Regression and Correlation


You can test the null hypothesis that Y and X
are independent, using the t-test
The t-statistics in SPSS output is
ˆ1 /se(
ˆ1 )
t 

The Type I error is a

You reject the null hypothesis if
t  t a /2 (n  2)
Copyright (c) Bani K. Mallick
36
Linear Regression and Correlation



In SPSS, you can get the interval as follows:
“Analyze”, “Regression”, “Linear”, Ask for
“Statistics” Option and accept “Confidence
Intervals”
While you are at it, accept the “Save” option
and save the predicted values, residuals,
Cook’s and Leverage
Copyright (c) Bani K. Mallick
37
GPA and Height
ANOVAb
Model
1
Regress ion
Res idual
Total
Sum of
Squares
1.954
24.294
26.247
df
1
98
99
Mean Square
1.954
.248
F
7.881
Sig.
.006 a
a. Predictors : (Constant), Height in inches
b. Dependent Variable: Grade Point Average (GPA)
What is the sample size?
Copyright (c) Bani K. Mallick
38
GPA and Height
ANOVAb
Model
1
Regress ion
Res idual
Total
Sum of
Squares
1.954
24.294
26.247
df
1
98
99
Mean Square
1.954
.248
F
7.881
Sig.
.006 a
a. Predictors : (Constant), Height in inches
b. Dependent Variable: Grade Point Average (GPA)
What is the sample size? 100, since the df for residual is n-2
Copyright (c) Bani K. Mallick
39
GPA and Height
Coefficientsa
Model
1
(Constant)
Height in inches
Unstandardized
Coefficients
B
Std. Error
5.529
.897
-3.72E-02
.013
Standardi
zed
Coefficien
ts
Beta
-.273
t
6.165
-2.807
Sig.
.000
.006
95% Confidence Interval for B
Lower Bound
Upper Bound
3.749
7.309
-.064
-.011
a. Dependent Variable: Grade Point Average (GPA)
95% CI for the slope is
Copyright (c) Bani K. Mallick
40
GPA and Height
Coefficientsa
Model
1
(Cons tant)
Height in inches
Uns tandardized
Coefficients
B
Std. Error
5.529
.897
-3.72E-02
.013
Standardi
zed
Coefficien
ts
Beta
-.273
t
6.165
-2.807
Sig.
.000
.006
95% Confidence Interval for B
Lower Bound Upper Bound
3.749
7.309
-.064
-.011
a. Dependent Variable: Grade Point Average (GPA)
95% CI for the slope is –0.064 to –0.011. What does
this mean?
Copyright (c) Bani K. Mallick
41
GPA and Height
Coefficientsa
Model
1
(Cons tant)
Height in inches
Uns tandardized
Coefficients
B
Std. Error
5.529
.897
-3.72E-02
.013
Standardi
zed
Coefficien
ts
Beta
-.273
t
6.165
-2.807
Sig.
.000
.006
95% Confidence Interval for B
Lower Bound Upper Bound
3.749
7.309
-.064
-.011
a. Dependent Variable: Grade Point Average (GPA)
95% CI for the slope is –0.064 to –0.011. What does
this mean? Population slope is between these limits
with 95% Probability. Are GPA and Height
Independent?
Copyright (c) Bani K. Mallick
42
GPA and Height
Coefficientsa
Model
1
(Cons tant)
Height in inches
Uns tandardized
Coefficients
B
Std. Error
5.529
.897
-3.72E-02
.013
Standardi
zed
Coefficien
ts
Beta
-.273
t
6.165
-2.807
Sig.
.000
.006
95% Confidence Interval for B
Lower Bound Upper Bound
3.749
7.309
-.064
-.011
a. Dependent Variable: Grade Point Average (GPA)
95% CI for the slope is –0.064 to –0.011. What does
this mean? Population slope is between these limits
with 95% Probability. Are GPA and Height
Independent? NO – the population slope is negative
Copyright (c) Bani K. Mallick
43
t-test

There are n-2 = 98 degrees of freedom for
residual (error)

The t-statistic for the slope is t = -2.807

The p-value is 0.006.

Take a = 0.05. Look up

Is

Yes since p = 0.006!!!
t a /2 (n  2)  t .025 (98)
t  t a /2 (n  2)  1.99 ?
Copyright (c) Bani K. Mallick
44
Residuals

You can check the assumption that the errors
are normally distributed by constructing a q-q
plot of the residuals
Copyright (c) Bani K. Mallick
45
Residuals of GPA on Height
Normal?
Regression Residuals, All Data
Y=GPA, X+Height
1.00
.75
Expected Cum Prob

.50
.25
0.00
0.00
.25
.50
.75
1.00
Observed Cum Prob
Copyright (c) Bani K. Mallick
46
Stenosis Data, Healthy Kids

Are Y=log(1+AVA) and X = BSA
independent?
ANOVAb
Model
1
Regress ion
Res idual
Total
Sum of
Squares
6.786
2.121
8.907
df
1
68
69
Mean Square
6.786
3.119E-02
F
217.543
Sig.
.000 a
a. Predictors : (Constant), Body Surface Area
b. Dependent Variable: Log(1 + Aortic Valve Area)
Copyright (c) Bani K. Mallick
47
Stenosis Data, Healthy Kids

Are Y=log(1+AVA) and X = BSA
independent? No: P= .000
ANOVAb
Model
1
Regress ion
Res idual
Total
Sum of
Squares
6.786
2.121
8.907
df
1
68
69
Mean Square
6.786
3.119E-02
F
217.543
Sig.
.000 a
a. Predictors : (Constant), Body Surface Area
b. Dependent Variable: Log(1 + Aortic Valve Area)
Copyright (c) Bani K. Mallick
48
Stenosis Data, Healthy Kids
Normal?
Stenosis Data, Healthy Kids
Y=log(1+AVA), X=BSA
1.00
.75
Expected Cum Prob

.50
.25
0.00
0.00
.25
.50
.75
1.00
Observed Cum Prob
Copyright (c) Bani K. Mallick
49
Stenosis Data, Healthy Kids
CI for Slope?
a
Coefficients
Model
1
(Constant)
Body Surface Area
Standardi
zed
Unstandardized
Coefficien
Coefficients
ts
B
Std. Error
Beta
.167
.039
.615
.042
.873
t
4.247
14.749
Sig.
.000
.000
95% Confidence Interval for B
Lower Bound Upper Bound
.088
.245
.532
.698
a. Dependent Variable: Log(1 + Aortic Valve Area)
Copyright (c) Bani K. Mallick
50