Download Stat 112 -

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Stat 112 -- Notes 4
• Chapter 3.5
• Chapter 3.7
Teachers’ Salaries and Dating
• In U.S. culture, it is usually considered
impolite to ask how much money a person
makes.
• However, suppose that you are single and
are interested in dating a particular person.
• Of course, salary isn’t the most important
factor when considering whom to date but it
certainly is nice to know (especially if it is
high!)
• In this case, the person you are interested in
happens to be a high school teacher, so you
know a high salary isn’t an issue.
• Still you would like to know how much she or
he makes, so you take an informal survey of
11 high school teachers that you know.
Distributions
Salary
35000
50000 60000
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
50881.818
6491.1968
1957.1695
55242.664
46520.973
11
Based on this data, what can you conclude?
Absent any other information, best guess for teacher’s salary is the
mean salary, $50,882.
But it is likely that this estimate will not be correct.
To get an idea of how far off, you might be, you can calculate the
standard deviation:
11
s
(y
i 1
i
 y) 2
n 1

421437378
 6491.82
10
The standard deviation is the “typical” amount by which an
observation deviates from mean.
Thus, your best estimate for your potential date’s salary is $50,882
but a typical estimate will be off by about $6,500.
• You happen to know that the person you are
interested in has been teaching for 8 years.
• How can you use this information to better
predict your potential date’s salary?
• Regression Analysis to the Rescue!
• You go back to each of the original 11
teachers you surveyed and ask them for their
years of experience.
• Simple Linear Regression Model: E(Y|X)= 0  1 X
, the distribution of Y given X is normal with

mean 0  1 X and standard deviation
.
Bivariate Fit of Salary By Years of Experience
65000
Salary
60000
55000
50000
45000
40000
35000
0
2.5
5
7.5 10 12.5
Years of Experience
Bivariate Fit of Salary By Years of Experience
65000
Salary
60000
55000
50000
45000
40000
35000
0
2.5
5
7.5 10 12.5
Years of Experience
Linear Fit
Linear Fit
Salary = 40612.135 + 1686.0674 Years of Experience
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.545881
0.495423
4610.93
50881.82
11
Linear Fit
Linear Fit
Salary = 40612.135 + 1686.0674 Years of Experience
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.545881
0.495423
4610.93
• Predicted salary of your potential date who has been
a teacher for 8 years = Estimated Mean salary for
teachers of 8 years = 40612.135+1686.0674*8 =
$54,100
• How far off will your estimate typically be? Root
mean square error = Estimated standard deviation of
Y|X = $4,610.93.
• Notice that the typical error of your estimate of
teacher salary using experience, $4,610.93, is less
than that of using only information on mean teacher
salary, $6,491.20.
• Regression analysis enables you to better predict
your potential date’s salary.
Summary of Fit
R Squared
RSquare
RSquare Adj
Root Mean Square Error
0.545881
0.495423
4610.93
• How much better predictions of your potential date’s
salary does the simple linear regression model
provide than just using the mean teacher’s salary?
• This is the question that R squared addresses.
• R squared: Number between 0 and 1 that measures
how much of the variability in the response the
regression model explains.
• R squared close to 0 means that using regression for
predicting Y|X isn’t much better than mean of Y, R
squared close to 1 means that regression is much
better than the mean of Y for predicting Y|X.
R Squared Formula
•
Total sum of squares - Residual sum of squares
R 
Total sum of squares
2
2
(
Y

Y
)
i1 i
n
• Total sum of squares =
= the
sum of squared prediction errors for using
sample mean of Y to predict Y
n
2
ˆ
(
Y

Y
)
• Residual sum of squares = i1 i i
,
where Yˆi  ˆ0  ˆ1 X i is the prediction of Yi
from the least squares line.
What’s a good R squared?
• A good R2 depends on the context. In precise
laboratory work, R2 values under 90% might be
too low, but in social science contexts, when a
single variable rarely explains great deal of
variation in response, R2 values of 50% may be
considered remarkably good.
• The best measure of whether the regression
model is providing predictions of Y|X that are
accurate enough to be useful is the root mean
square error, which tells us the typical error in
using the regression to predict Y from X.
More Information About Your
Potential Date’s Salary:
Prediction Intervals
• From the regression model, you predict that your
potential date’s salary is $54,100 and the typical error
you expect to make in your prediction is $4,611.
• Suppose you want to know an interval that will most
of the time (say 95% of the time) contain your date’s
salary?
• We can find such a prediction interval by using the
fact that under the simple linear regression model,
the distribution of Y|X is normal, here the
subpopulation of teachers with 8 years of experience
has a normal distribution with estimated mean
$54,100 and estimated standard deviation $4,611.
Prediction Interval
• A 95% prediction interval has the property
that if we repeatedly take samples y1,..., yn
from a population with the simple
regression model where x1,..., xn are fixed
at theirx current
values and then sample y p
xp
with
,the prediction interval will
yp
contain
95% of the time.
Best prediction of y : Yˆ  Eˆ (Y | X  X )  b  b X
•
p
p
p
0
1
p
2
1 (X p  X )
,
s p  RMSE 1  
2
n (n  1) s X
1
n
2
s X2 
(
X

X
)
.

i
i 1
n 1
95% Prediction Interval: Yˆp  t.025,n  2 s p
Comment: For large n, the 95% prediction interval is approximate
Yˆp  2* RMSE
Prediction Interval for Your Date’s
Salary
• Suppose your date has 8 years of
experience. Yˆ  40612.14+1686.07*8=54100.7
p
2
1 (X p  X )
=
s p  RMSE 1  
n (n  1) s X 2
1 (8  6.09) 2
4610.93 1  

2
11 10* 2.844
5238.07
95% Prediction Interval:
Yˆp  t.025,n  2 s p  54100.7  2.262*5238.07 
(42252.19, 65949.21)
Your date’s salary will be in the range
(42252.19,65949.21) most of the time.
We obtain X and S X2 from Analyze, Distribution on the X variable.
Distributions
Years of Experience
12.5
10
7.5
5
2.5
0
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
6.0909091
2.8444523
0.8576346
8.0018382
4.17998
11
Prediction Intervals in JMP
• After using Fit Line, click the red triangle next to
Linear Fit and click Confid Curves Indiv.
65000
60000
Salary
55000
50000
45000
40000
35000
0
2.5
5
7.5
10
12.5
Years of Experience
• Use the crosshair tool (under Tools) to find the
exact prediction interval for a particular x value.
Association vs. Causality
• A high
means that x has a strong linear
relationship with y – there is a strong
association between x and y. It does not
imply that x causes y.
2
R
• Alternative explanations for high :
R2
– Reverse is true. Y causes X.
– There may be a lurking (confounding) variable
related to both x and y which is the common
cause of x and y
Salary of Presbyterian Ministers in
Bivariate Fit of Salary of Presbyterian Ministers in MA By Price of Rum
50000
40000
1998
30000
1982
20000
1954
1926
1886
10000
0
0
2.5
5
7.5 10 12.5
Price of Rum
Are the Presybterian ministers benefiting from the rum trade or
supporting it?
Example
• A community in the Philadelphia area is
interested in how crime rates affect property
values. If low crime rates increase property
values, the community may be able to cover the
costs of increased police protection by gains in
tax revenues from higher property values. Data
on the average housing price and crime rate
(per 1000 population) communities in
Pennsylvania near Philadelphia for 1996 are
shown in housecrime.JMP.
Bivariate Fit of HousePrice By CrimeRate
500000
Residual
300000
HousePrice
400000
300000
200000
100000
0
-100000
10
200000
20
30
40
CrimeRate
Distributions
Residuals HousePrice
100000
0
10
20
30
40
50
60
70
CrimeRate
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Parameter Estimates
Term
Estimate
Intercept
225233.55
CrimeRate
-2288.689
0.184229
0.175731
78861.53
158464.5
98
Std Error
16404.02
491.5375
-100000
t Ratio
13.73
-4.66
0
Prob>|t|
<.0001
<.0001
100000 200000 300000
50
60
70
Questions
1. Can you deduce a cause-and-effect
relationship from these data? What are
other explanations for the association
between housing prices and crime rate
other than that high crime rates cause
low housing prices?
2. Does the simple linear regression model
appear to hold?
Extrapolation
• When constructing estimates of E (Y | X new ) or
predicting individual values of a dependent value
based on xnew , caution must be used if is
outside the range of the observed x’s. The data
does not provide information about whether the
simple linear regression model continues to hold
outside of the range of the observed x’s.
• Example: The crime rate in Center City
Philadelphia is 366.1. Does the simple linear
regression model fit from housecrimerate.JMP
provide an accurate prediction of the average
house price in Center City.