Download STAB22 Statistics I Lecture 9 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STAB22 Statistics I
Lecture 9
1
Linear Model

True value (y)
Linear model equation:
Residual (y−ŷ)
ŷ  b0  b1 x
Predicted value (ŷ)

Where:


intercept b0: ŷ value at x=0
slope b1: change in ŷ for unit increase in x
b0
b1
2
0
Linear Regression

Best fitting line minimizes sum of squared
residuals (least squares criterion), given by:
b1  r

sy
sx
& b0  y  b1 x
Where:
r  correlation between x and y
sx , s y  std. dev. of x, y
Note: regression
line always passes
through point ( x , y )
i.e. through means
of the variables
x , y  sample mean of x, y
3
# obsn=25
Mean
SD
Lung Cancer
Mortality (y)
109.00
26.11
Smoking (x)
102.88
r
120
100
60
Variable
80
Lung Cancer vs Smoking
Mortality

140
Example
.7162
17.20
70
80
90
100
120
Smoking

Find linear model

Predict lung cancer mortality for Smoking = 85
4

0
-1
Let’s standardize data (i.e.
take variable z-scores)
Fill in summary table
-2

z-score(Mortality)
1
Example (cont’d)
Variable
Mean
SD
r
z-score (y)
z-score (x)

-2
-1
0
1
2
z-score(Smoking)
Find new linear model for z-scores
5
Regression & Correlation
Correlation coefficient (r) between two
variables essentially equals slope of linear
model of standardized values (z-scores)
Direction & strength of relationship ↔ sign &
magnitude of slope. E.g.
r=+0.05
-1
-1
-2
-2
-2
-1
0
1
2
r=+0.85
0
-1
-2
0
0
1
2
2
r=−0.50
1
2

1

-2
-1
0
1
2
6
-2
-1
0
1
2
Regression Diagnostics

Can fit linear model to
any set of data


E.g. X, Y don’t need to be
linearly related
Want to check whether linear model offers
good description of data

Can use scatterplot; but residual plot often
provides a better picture
7
Residual Plot

Plot residuals (y−ŷ) against x
residual plot
scatterplot
y
( y  yˆ )
0

x
Residual plot should be evenly scattered
around 0, with no particular pattern

Note: mean of residuals is always 0
x
8
Residual Plot

What can go wrong? Non-linearity
scatterplot

residual plot
Easier to see in residual plot
9
Residual Plot

Uneven dispersion
scatterplot

residual plot
If residual spread changes with x → linear model
is not evenly accurate throughout x
10
Residual Standard Deviation

If linear regression assumptions are satisfied,
can measure prediction accuracy using
residual SD (a.k.a. error SD)
s = √ (mean square
e
residuals)
se 
  y  yˆ 
2
n2
se measures average
distance between true and
predicted values (i.e. between data & linear model)

11
Coefficient of Determination


How useful is linear regression model in
describing y-variable?
Compare y-data’s variation to residual
variation from linear model
data
variation
  y  y 
2
model
variation
   yˆ  y 
2
y
ŷ  b0  b1 x
12
Coefficient of Determination

Coefficient of determination (R2):
Proportion of y-variation
accounted for by linear model


R2 between 0 & 1
Equal to squared
coefficient of correlation (r)
model var.
R 
 r2
data var.
2


Proportion of y-variation
left in residuals = 1−R2
residual
variation
   y  yˆ 
2
13
Mean
SD
Lung Cancer
Mortality (y)
109.00
26.11
Smoking (x)
102.88

Find R2
r
.7162
17.20
120
100
80
Variable
60
Lung Cancer vs Smoking
Mortality

140
Example
70
80
90
100
120
Smoking
14
Related documents