Download Class 5: Thurs., Sep. 23 - University of Pennsylvania

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Class 5: Thurs., Sep. 23
• Example of using regression to make
predictions and understand the likely errors
in the predictions: salaries of teachers and
experience
• Normal distribution calculations
• R squared
• Checking the assumptions of the simple
linear regression model: residual plots.
Teachers’ Salaries and Dating
• In U.S. culture, it is usually considered impolite to
ask how much money a person makes.
• However, suppose that you are single and are
interested in dating a particular person.
• Of course, salary isn’t the most important factor
when considering whom to date but it certainly is
nice to know (especially if it is high!)
• In this case, the person you are interested in
happens to be a high school teacher, so you know
a high salary isn’t an issue.
• Still you would like to know how much she or he
makes, so you take an informal survey of 11 high
school teachers that you know.
Distributions
Salary
35000
50000 60000
Moments
Mean
Std Dev
Std Err Mean
upper 95% Mean
lower 95% Mean
N
50881.818
6491.1968
1957.1695
55242.664
46520.973
11
Based on this data, what can you conclude?
Absent any other information, best guess for teacher’s salary is the
mean salary, $50,882.
But it is likely that this estimate will not be correct.
To get an idea of how far off, you might be, you can calculate the
standard deviation:
11
s
(y
i 1
i
 y) 2
n 1

421437378
 6491.82
10
The standard deviation is the “typical” amount by which an
observation deviates from mean.
Thus, your best estimate for your potential date’s salary is $50,882
but a typical estimate will be off by about $6,500.
• You happen to know that the person you are
interested in has been teaching for 8 years.
• How can you use this information to better predict
your potential date’s salary?
• Regression Analysis to the Rescue!
• You go back to each of the original 11 teachers
you surveyed and ask them for their years of
experience.
• Simple Linear Regression Model: E(Y|X)= 0  1 X
, the distribution of Y given X is normal with
mean 0  1 X and standard deviation  .
Bivariate Fit of Salary By Years of Experience
65000
Salary
60000
55000
50000
45000
40000
35000
0
2.5
5
7.5 10 12.5
Years of Experience
Bivariate Fit of Salary By Years of Experience
65000
Salary
60000
55000
50000
45000
40000
35000
0
2.5
5
7.5 10 12.5
Years of Experience
Linear Fit
Linear Fit
Salary = 40612.135 + 1686.0674 Years of Experience
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.545881
0.495423
4610.93
50881.82
11
Linear Fit
Linear Fit
Salary = 40612.135 + 1686.0674 Years of Experience
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.545881
0.495423
4610.93
• Predicted salary of your potential date who has
been a teacher for 8 years = Estimated Mean
salary for teachers of 8 years =
40612.135+1686.0674*8 = $54,100
• How far off will your estimate typically be? Root
mean square error = Estimated standard deviation
of Y|X = $4,610.93.
• Notice that the typical error of your estimate of
teacher salary using experience, $4,610.93, is less
than that of using only information on mean
teacher salary, $6,491.20.
• Regression analysis enables you to better predict
your potential date’s salary.
More Information About Your
Potential Date’s Salary
• From the regression model, you predict that your potential
date’s salary is $54,100 and the typical error you expect to
make in your prediction is $4,611.
• Suppose you want to know an interval that will most of the
time (say 95% of the time) contain your date’s salary?
What’s the chance that yourdate will make more than
$60,000? What’s the chance that your date will make less
than $50,000?
• We can answer these questions by using the fact that under
the simple linear regression model, the distribution of Y|X
is normal, here the subpopulation of teachers with 8 years
of experience has a normal distribution with mean $54,100
and standard deviation $4,611.
• 95% interval: For the subpopulation of teachers with 8
years of experience, 95% of the salaries will be within two
SDs of the mean. An interval that will contain a randomly
chosen teacher’s salary with 8 years of experience 95% of
the time is: $54,100
2*$4,611 = ($44,878,$63,322).
 that your date will make more than
• What’s the probability
$60,000? If you don’t have any additional information
about your date other than his or her number of years of
teaching, we can assume that your date is a random draw
from the subpopulation of teachers with 8 years of
teaching.
• According to the simple linear regression model, the
subpopulation of teachers with 8 years of experience is
estimated to have a normal distribution with mean $54,100
and standard deviation $4,611.
Properties of the Normal
Distribution (Section 1.3)
• Suppose a variable Y has a normal distribution
with mean  and standard deviation  . Then
Y 
Z
follows
a
standard
normal
distribution.

• Then the probability that Y is greater than a
number c equals
P(Y  c)  P(
Y 


c

)  P( Z 
c

)
where Z equals standard normal distribution with
mean 0 and SD 1.
The probabilities for a standard normal
distribution can be found in Table A.
Review Section 1.3 on using the normal tables.
• Probability that a teacher with 8 years of
experience has salary > $60,000:
Y  54,100 60,000  54,100

)  P( Z  1.28)
4,611
4,611
 1  P( Z  1.28)  1  0.8997  0.1003
P(Y  60,000)  P(
• Probability that a teacher with 8 years of
experience has salary < $50,000:
P(Y  50,000)  P(
Y  54,100 50,000  54,100

)  P( Z  0.89)  0.1867
4611
4,611
• Probability that a teacher with 8 years of
experience has salary between $52,000 and
$56,000:
52,000  54,100 Y  54,100 56,000  54,100


)
4,611
4,611
4,611
P(0.46  Z  0.41)  P( Z  0.41)  P( Z  0.46)  0.6591  0.3228  0.3363
P(52,000  Y  56,000)  P(
Summary of Fit
R Squared
RSquare
RSquare Adj
Root Mean Square Error
0.545881
0.495423
4610.93
• How much better predictions of your potential
date’s salary does the simple linear regression
model provide than just using the mean teacher’s
salary?
• This is the question that R squared addresses.
• R squared: Number between 0 and 1 that measures
how much of the variability in the response the
regression model explains.
• R squared close to 0 means that using regression
for predicting Y|X isn’t much better than mean of
Y, R squared close to 1 means that regression is
much better than the mean of Y for predicting
Y|X.
R Squared Formula
•
Total sum of squares - Residual sum of squares
R 
Total sum of squares
2
• Total sum of squares = n (Yi  Y )2 = the
i 1
sum of squared prediction errors for using
sample mean of Y to predict Y
• Residual sum of squares = in1 (Yi  Yˆi )2 ,
where Yˆi  ˆ0  ˆ1 X i is the prediction of Yi
from the least squares line.
What’s a good R squared?
• As with correlation, it depends on the context.
• A good R2 depends on the context. In precise
laboratory work, R2 values under 90% might be
too low, but in social science contexts, when a
single variable rarely explains great deal of
variation in response, R2 values of 50% may be
considered remarkably good.
• The best measure of whether the regression model
is providing predictions of Y|X that are accurate
enough to be useful is the root mean square error,
which tells us the typical error in using the
regression to predict Y from X.
Checking the model
•
•
1.
2.
3.
4.
The simple linear regression model is a great
tool but its answers will only be useful if it is the
right model for the data. We need to check the
assumptions before using the model.
Assumptions of the simple linear regression
model:
Linearity: The mean of Y|X is a straight line.
Constant variance: The standard deviation of
Y|X is constant.
Normality: The distribution of Y|X is normal.
Independence: The observations are
independent.
Checking that the mean of Y|X is
a straight line
1. Scatterplot: Look at whether the mean of
Y given X appears to increase or decrease
in a straight line.
Bivariate Fit of Heart Disease Mortality By Wine Consumption
65000
12
60000
10
Heart Disease Mortality
Salary
Bivariate Fit of Salary By Years of Experience
55000
50000
45000
40000
35000
8
6
4
2
0
2.5
5
7.5 10 12.5
Years of Experience
0
10
20
30
40
50
Wine Consumption
60
70
80
Residual Plot
• Residuals: Prediction error of using
regression to predict Yi for observation i:
resi  Yi  Yˆi
, where Yˆi  ˆ0  ˆ1 X i
• Residual plot: Plot with residuals on the y
axis and the explanatory variable (or some
other variable on the x axis.
Residual
Residual
5000
0
-5000
-10000
0
2.5
5
7.5
Years of Experience
10
12.5
3
2
1
0
-1
-2
-3
0
10
20
30
40
50
Wine Consumption
60
70
80
• Residual Plot in JMP: After doing Fit Line, click
red triangle next to Linear Fit and then click Plot
Residuals.
• What should the residual plot look like if the
simple linear regression model holds? Under
simple linear regression model, the residuals
resi  Yi  Yˆi  Yi  (ˆ0  ˆ1 X i ) should have approximately
a normal distribution with mean zero and a
standard deviation which is the same for all X.
• Simple linear regression model: Residuals should
appear as a “swarm” of randomly scattered points
about their (which is always zero).
• A pattern in the residual plot that for a certain
range of X the residuals tend to be greater than
zero or tend to be less than zero indicates that the
mean of Y|X is not a straight line.
Bivariate Fit of Mileage By Speed
40
35
Mileage
30
25
20
15
Data Simulated From A Simple Linear Regression Model
10
5
0
10 20 30 40 50 60 70 80 90 100 110
Speed
Idealreg.JMP
Bivariate Fit of Y By X
110
100
Linear Fit
Linear Fit
90
80
70
Mileage = 23.266776 - 0.0012701 Speed
Y
0
60
50
40
30
20
-10
-20
0
10
20
30
40
50
60
70
80
90 100 110
10
0
Speed
0
10 20 30 40 50 60 70 80 90 100 110
X
2
Residual
Residual
10
1
0
-1
-2
0
10
20
30
40
50
60
X
70
80
90 100 110
Summary
• Normal distribution can be used to calculate
probability that Y takes on certain values given X
• R squared: measure of how much regression
improves on ignoring X when predicting Y.
• Assumptions of simple linear regression model
must be checked in order for model to be used.
Residual plots can be used to check the linearity
assumption.
• Tuesday’s class: Section 2.4 (more on checking
assumptions, outliers and influential points,
lurking variables).