Download x - My Teacher Pages

Document related concepts

Data assimilation wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Forecasting wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapters 8 & 9
Linear Regression & Regression Wisdom
Price of Homes Bases on Size (in Square Feet)
Sold in Ames between Sep. 2004 and Oct. 2005
r = 0.8718945
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Statistical Modeling





Statistical Model: An equation that fits the pattern
between a response variable and possible explanatory
variables, accounting for deviations from the model.
(Simplest case: one quantitative response variable and
one quantitative explanatory variable.)
Response Variable (Y): The quantitative outcome of a
study.
Explanatory Variable (X): A quantitative variable that
may explain or predict the response variable
What is the beset model for: Predicting weight (Y) from
height (X)?
What is the best model for: Predicting blood pressure
(Y) from age (X)?
Correlation and the Line
Price of Homes Based on Square Feet
Price = -90.2458 + 0.1598SQFT
r = 0.8718945
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Regression line
Explains how the response variable (y)
changes in relation to the explanatory
variable (x)
 Use the line to predict value of y for a
given value of x

Regression line

Need a mathematical formula

We want to predict y from x


The predicted values are called ŷ.
The observed values are called y.
Which Line is Best?
What are some ways we can determine
which model out of all the possible
models is the “best” one?
 What are some ways that we can
numerically rank the different models. (i.e.
the different lines)

Which Model is Best?
Price = -90.2458 + 0.1598SQFT (red)
Price = -300 + 0.3SQFT (blue)
Price = 0 + 0.1SQFT (green)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Regression line
“Putting a hat on it” means we have
predicted something from the model
 Look at vertical distance

y  yˆ
Amount of error in the regression line
 The goal is to find the line so that these
errors are minimized.

Least squares regression
Most commonly used regression line
 Makes the sum of the squared errors as
small as possible
 Based on the statistics

x , y , sx , s y , r
Regression line equation
yˆ  b0  b1 x
where
b1  r
sy
sx
b0  y  b1 x
Regression line equation


b1 = slope of line. For every unit increase in x, y
changes by the amount of the slope.
Interpreting b1 (slope):


For every one unit increase in the explanatory
variable, there will be, on average, a b1 unit(s)
increase/decrease in the response variable.
For example: For every one square foot increase in
size, on average, there will be a $159.80 increase in
home price.
MEMORIZE THIS!!!!!
Regression line equation


b0 = y-intercept of line. The value of y when x =
0.
Interpreting b0 (y-intercept):



When the explanatory variable = 0, on average, the
value of the response variable = b0.
For example: When the sq. ft. of a home is 0, the
price of the home will be -$90,245.80 on average.
MEMORIZE THIS!!!!!
BE CAREFUL. The interpretation of the intercept
does not always make sense. When interpreting, be
sure to mention if the interpretation does not make
sense.
Example – Kobe’s Shooting
I visited cnnsi’s website and checked out
some of Kobe Bryant’s personal scoring
numbers. I looked at the number of times
he shot the ball and his point total for
each game so far this year.
 Let’s come up with the regression
equation for this data.

Kobe’s Shooting
r = 0.7293762
Form: Linear
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Strength: Moderate to Strong
Direction: Positive
Calculating the regression line

Remember that:
Our explanatory variable(x) is the number of
shots
 Our response variable(y) is the number of
points


So the five numbers needed are:
x  27.04, sx  7.41,
y  35.71, sy  12.13,
r  0.749768
Calculating the Regression Line

Find the Slope
sy
12.13
b1  r  (0.7293762)
 1.19
sx
7.41

Find the Intercept
b0  y  b1 x  35.71  .90(27.04)  3.436
Calculating the regression line.

Don’t forget to write the equation.
ŷ  3.436  1.19x

DON’T FORGET TO WRITE THE
EQUATION IN THE CONTEXT OF THE
PROBLEM.
pts  3.436  1.19(number of shots)
Interpretation

How would we interpret b1?


For a one shot increase from Kobe Bryant,
on average we would expect him to score
1.19 more points.
How would we interpret b0?

If Kobe Bryant did not take one shot then on
average we would expect him to score 3.436
points
Prediction

Use the regression equation to predict y from
x.

Ex. What is the predicted number of points when
Kobe shoots 30 times in a game?
ŷ  3.436  1.19(30)  39.136

Ex. What is the predicted number of points when
Kobe shoots 55 times in a game?
ŷ  3.436  1.19(55)  68.886
Plotting the regression line

Find two points on the line:

Ex. x = 30, y = 39 and x = 55, y =69
• If you are plotting by hand it is ok to round values
Plot these two points on the graph
 Connect the points
 This is the regression line

Plotting the Regression Line
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Properties of regression line

r is related to the value of b1
r has the same sign as b1
 One standard deviation change in x
corresponds to r times one standard
deviation change in y
 The regression line always goes through the
point ( x , y )

Properties of regression line
 r2
Percent of variation in y that is explained by
the least squares regression of y on x
 The higher the value of r2, the more the
regression line explains the changes that
occur in the y variable
 The higher the values of r2, the better the
regression line fits the data

Properties of regression line
 r2


0  r2  1 since -1  r  1
Interpretation of r2


r2 is the percent of variation in the response
variable that can be explained by the least squares
regression of the response variable on the
explanatory variable.
For Kobe’s example: 53.1% of the variability in the
number of points Kobe Bryant scores in a game can
be explained by the LS regression of points per
game on number of shots per game (g).
MEMORIZE THIS!!!!
Residuals
Amount of variation in y not taken into
account by regression line
 Formula: y  y
ˆ
 There is a residual for each data point
 Mean of the residuals is zero

Calculating Residuals – Kobe
ŷ  3.436  1.19x
pts  3.436  1.19(number of shots)

Find the residual for the point (46,81)

First find the predicted number of calories for a sandwich with
a serving weight of 182 g:
ŷ  3.436  1.19(46)  58.176

Now find residual:
residual  y  ŷ  81  58.176  22.824
Calculating Residuals – Kobe

Find the residual for the point (26,35)
ŷ  3.346  1.19(26)  34.286
residual  y  ŷ  35  34.286  0.714
Residual Plots

Scatterplot of Residuals
Explanatory variable on horizontal axis
 Residuals on vertical axis
 Horizontal line at residual = 0

Residual Plots
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Interpreting Residual Plots

Is there a curved pattern?


Is there increasing spread about the line as x
increases?


This could mean a non-linear relationship
This could mean non-constant variance
Is there decreasing spread about the line as x
increases?

This could mean non-constant variance
Interpreting Residual Plots

Points with large residuals



These are probably outliers in the y direction
These will pull the regression line in the direction of
the outlier (up or down)
Extreme points in the x direction



These are called influential points
They do not always show up in residuals because
the residual could be small
Removing the outlier could markedly change the
regression line
Reading JMP Data
Bivariate Fit of BAC by # of Beers
0.2
BAC
0.15
0.1
0.05
0
0
2
4
6
Beers
8
10
Reading JMP Data
Linear Fit
Linear Fit
BAC = -0.011654 + 0.0180112 # of Beers
This is the regression line for the data.
Slope is 0.0180112. y-Intercept is -0.011654.
The response variable is the BAC.
The explanatory variable is the # of Beers.
Reading JMP Data
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
This gives some summary of the data.
RSquare = r2 = (r)2 = (correlation)2
Root Mean Square Error = s
Mean of response = y
Observations = n
0.803536
0.788424
0.020920
0.076000
15
Reading JMP Data
Analysis of Variance
Source
Model
Error
C. Total
DF
1
13
14
Sum of Squares
0.02327041
0.00568959
0.02896000
Mean Square
0.023270
0.000438
This is called the ANOVA Table.
This is another way to analyze the data.
We aren’t going to discuss this in this class.
F Ratio
53.1700
Prob > F
<.0001
Reading JMP Data
Parameter Estimates
Term
Intercept
#beers
Estimate
-0.011654
0.0180112
Std Error
0.013179
0.002470
t Ratio
-0.88
7.29
This tells you what the y-intercept and slope
are. It also gives the standard error for each of
the estimates. If you were to form confidence
intervals for the parameter estimates, you
would need these values. We won’t discuss
that in this class.
Prob>|t|
0.3926
<.0001
Reading JMP Data
Residual
0.05
0.03
0.01
-0.01
-0.03
0
2
4
6
8
10
Beers
Here is your residual plot. Check it to see if there are any
problems with linearity of the data and constant variance.
Example
at
F
G
e
s
l
c
o
r
60 70 80 90 10 10 120
A ge
10
20
30
A ge
40
at
fi rs
Example

Age at first word vs. Gesell score.
Scatterplot: Weak negative linear
relationship between two variables. Possible
outliers at (42,57) and (17,121).
 Regression: r = -0.64, r2 = 40.96%.

yˆ  109.87  1.13x
Example
at
F
G
e
s
l
c
o
r
60 70 80 90 10 10 120
A ge
10
20
30
A ge
40
at
fi rs t
Example

Age at first word vs. Gesell score.

Prediction:
• When x=17
• When x=42

Residuals:
• point (17,121)
• point (42,57)
Example
R
e
s
i
d
u
a
l
-10 0 10 20 30
Re s id u a
10
20
30
A ge
40
at
Fi rs
Example

Residual Plot
Outliers at x=17 and x=42
 Small residual for x=42

• Could be influential

Remove (42,57) from data.
Regression line changes markedly.
 r = -0.33, r2 = 10.89%.

Example
at
F
G
e
s
l
c
o
r
60 70 80 90 10 10 120
A ge
10
20
30
A ge
40
at
fi rs
Outliers--What should you do?
Make sure data points have been
recorded correctly
 Collect more data
 Remove the outlier
 Examine collection techniques
 Examine outside influences

Cautions about regression
Linear relationship only
 Not resistant
 Using averaged data

Makes relationship appear stronger
 Taking average removes variation


Extrapolation

Predicting y when x value is outside the
original data
Cautions about Regression

Extrapolation




Remember the data about home prices vs. the
amount of sq. footage in the home.
The regression line we found based on data
collected from homes with 900 to 3,000 sq. ft. is
price  75.47  0.69( sq. ft.)
This would mean that if my home has no square
footage, then I pay -$75,470.
If you must extrapolate, at least don’t expect that
your prediction will come true.
Cautions about regression

ASSOCIATION IS NOT CAUSATION!

Strong association between explanatory and
response variables does not mean that the
explanatory variable causes the response
variable.
Proving Causation

Experiment
Change the values of x and control for lurking
variables.
 Not all problems can be solved by
experiment

• Smoking causes lung cancer.
• Living near power lines causes leukemia.
Proving Causation

Lurking variable
Important effect on variables, but not
included in study.
 Example:

• Do taller people make more money? What do
you think a lurking variable might be?
•
Proving Causation

Proving smoking causes lung cancer
Association is strong
 Association is consistent
 High doses are associated with stronger
response
 Cause precedes the effect in time
 Cause is plausible

Review
Number of Calories By Sugar Content (g) for 13
Cereals
150
Let’s calculate the
formula for this
regression line:
cals
125
100
75
50
25
0
5
sugar (g)
10
15
Review

Let’s review all the formulas we need:
yˆ  b0  b1 x
b1  r
sy
sx
b0  y  b1 x
1   ( x  x )( y  y ) 
r

n  1 
sx s y

s
( y  y)
n 1
2
1
y  y
n
Review

Here are all the numbers you need:
x
n  13
( x
( y  y )
2
 94
 y  1280
 x )( y  y )  1014.66
 6169.21
2
(
x

x
)
 301.97

Review

First, calculate sx and sy:
sx 
sy 
(x  x)
2
n 1
( y  y)
n 1

301.91
 5.02
12

6169.21
 22.67
12
2
Review

Second, calculate r:
1014.66
1014.66
r

 0.743
(13  1)( 22.67)(5.02) 1365.64

Third, calculate b1:
 22.67 
b1  (.743)
  3.36
 5.02 
Review

Fourth, calculate x and y :
94
x
 7.23
13

1280
y
 98.46
13
Fifth, calculate a (we’re almost done!!):
b0  98.46  3.36(7.23)  74.17
Review

Last, but definitely the most important,
WRITE DOWN THE EQUATION IN THE
CONTEXT OF THE PROBLEM:
Calories  74.17  3.36( sugar )
Review

Interpret b1:


For every one gram increase in sugar, the
number of calories will increase by 3.36.
Interpret r2:

About 55% of the variability in the number of
calories in cereal can be explained by the LS
regression of calories on sugar content.