Download Stat 112: Notes 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Confidence interval wikipedia , lookup

Bias of an estimator wikipedia , lookup

Time series wikipedia , lookup

German tank problem wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 112: Notes 2
• Today’s class: Section 3.3.
– Full description of simple linear regression
model.
– Checking the assumptions of the simple linear
regression model.
– Inferences for simple linear regression model.
Wages and Education
• A random sample of 100 men (ages 18-70) was
surveyed about their weekly wages in 1988 and their
education (part of the 1988 March U.S. Current
Population Survey) (in file wagedatasubset.JMP)
• How much more on average do men with one extra year
of education make?
• If a man has a high school diploma but no further
education, what’s the best prediction of his earnings?
• Regression addresses these two questions
Bivariate Fit of wage By educ
X=Education, Y= Weekly Wage
2500
wage
2000
1500
1000
500
0
5
10
educ
15
20
Simple Linear Regression Model
Bivariate Fit of wage By educ
2500
wage
2000
1500
1000
500
0
5
10
15
20
educ
Linear Fit

Linear Fit
wage = -89.74965 + 51.225264*educ
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.139941
0.131165
331.48

The mean of weekly wages is estimated
to increase b1  51.23 dollars for each
extra year of education.
The average absolute error from using
a man’s education to predict his wages is
about RMSE=331.48 dollars
Sample vs. Population
• We can view the data – ( X1 , Y1 ), , ( X n , Yn ) -- as
a sample from a population.
• Our goal is to learn about the relationship
between X and Y in the population:
– We don’t care about the particular 100 men sampled
but about the population of US men ages 18-70.
– From Notes 1, we don’t care about the relationship
between tracks counted and the density of deer for
the particular sample, but the relationship among the
population of all tracks; this enables to predict in the
future the density of deer from the number of tracks
counted.
Simple Linear Regression Model
The simple linear regression model:
E (Yi | X  X i )  0  1 X i , Yi  0  1 X i  ei ,
ei ~ N (0,  e2 )
The ei are called disturbances and represent the deviation
of Yi from its mean given X i . The disturbances are
estimated by the residuals eˆi  Yi  (b0  b1 X i ) .
Assumptions of the Simple Linear
Regression Model
For each value of the explanatory variable X=x, there is a
subpopulation of outcomes (responses) Y for units with X=x.
Assumptions of the simple linear regression model:
1. Linearity: The means of the subpopulations fall on a
straight line function of the explanatory variable.
2. Constant variance: The subpopulation standard deviations
are all equal (to  ).
3. Normality: The subpopulations are all normally distributed.
4. Independence: The selection of an outcome from any of the
subpopulations is independent of the selection of any other
outcomes.
Checking
the
Assumptions
Simple Linear Regression Model for Population:
Yi  0  1 xi  ei .
Before making any inferences using the simple linear
regression model, we need to check the assumptions:
Based on the data ( X 1 , Y1 ), , ( X n,Yn ) ,
1. We estimate  0 and 1 by the least squares estimates
b0 and b1 .
2. We estimate the disturbances ei by the residuals
eˆi  Yi  Eˆ (Y | X i )  Yi  (b0  b1 X i ) .
3. We check if the residuals approximate satisfy
(1) Linearity: E (eˆi )  0 for all range of X i .
(2) Constant Variance: Var (eˆi ) constant for all
range of X i .
(3) Normality: eˆi are approximately normally
distributed.
(4) Independence : eˆi are independent (only worry
about for time series data).
Residual Plot
A useful tool for checking the assumptions is the
residual plot.
Residual for observation i
eˆi  yi  Eˆ ( yi | xi )  yi  (b0  b1 xi ) .
The residual plot is a plot of the residuals eˆi versus xi .
It is constructed in JMP by after fitting the least squares
line, clicking the red triangle next to Linear Fit and clicking
Plot Residuals.
Residual
1500
1000
500
0
-500
5
10
15
educ
20
Checking Linearity Assumption
To check if the linearity assumption holds (i.e., the model
for the mean is correct), check if E (eˆi ) is zero for each
range of X i .
Residual
1500
1000
500
0
-500
5
10
15
20
educ
Linearity Assumption appears reasonable but it appears that
very high education individuals and low education
individuals earn more than expected (most residuals are
positive) [we will consider a nonlinear model for this data
in Chapter 5, for now we’ll assume linearity is okay).
Violation
of
Linearity
For a sample of McDonald’s restaurants
Y=Revenue of Restaurant
X=Mean Age of Children in Neighborhood of
Restaurant
Bivariate Fit of Revenue By Age
1300
300
Residual
Revenue
1200
1100
1000
200
100
0
-100
-200
2.5
900
5.0
7.5
10.0
Age
800
2.5
5.0
7.5
10.0
12.5
15.0
Age
The mean of the residuals is negative for small and
large ages and positive for large ages – linearity
appears to be violated (we will see what to do when
linearity is violated in Chapter 5).
12.5
15.0
Checking Constant Variance
To check that the constant variance assumption holds,
check that there is no pattern in the spread of the residuals
as X varies.
Residual
1500
1000
500
0
-500
5
10
15
20
educ
Constant variance appears reasonable.
Checking Normality
For checking normality, we can look at whether the overall
distribution of the residuals looks approximately normal by
making a histogram of the residuals. Save the residuals by
clicking the red triangle next to Linear Fit after Fit Line and
then clicking Save Residuals. Then click Analyze,
Distribution and put the saved residuals column into Y,
Columns. The histogram should be approximately bell
shaped if the normality assumption holds.
Distributions
Residuals wage
-500
0
500
1000
1500
The residuals from the wage data have approximately a bell
shaped histogram although there is some indication of
skewness to the right. The normality assumption seems
roughly reasonable. We will look at more formal tools for
assessing normality in Chapter 6.
Checking Assumptions
• It is important to check the assumptions of
a regression model because the
inferences depend on the assumptions
approximately holding. The assumptions
don’t need to hold exactly but only
approximately.
• We will study more about checking
assumptions and how to deal with
violations of the assumptions in Chapters
5 and 6.
Inferences
Simple Linear Regression Model for Population:
E (Yi | X  X i )  0  1 X i , Yi  0  1 X i  ei ,
ei ~ N (0,  e2 )
Data: ( X1 , Y1 ),
, ( X n , Yn ) .
The least squares estimates b0 and b1 will typically not be
exactly equal to the true  0 and 1 .
Inferences: Draw conclusions about  0 and 1 based on the
data ( X1 , Y1 ), , ( X n , Yn ) .
1. Point Estimates: Best estimates of  0 and 1 .
2. Confidence intervals: Ranges of plausible values of
 0 and 1 .
3. Hypothesis tests: Test whether it is plausible that
 0 and 1 equal certain values.
Sampling Distribution of b0,b1
• The sampling distribution of b0 ,b1
describes the probability distribution of the
estimates over repeated samples ( x1, y1 ),, ( xn , yn )
from the simple linear regression model.
• Understanding the sampling distribution is
the key to drawing inferences from the
sample to the population.
Sampling distribution in wage data
• To see how the least squares estimates can differ over
different samples from the population, we consider the
“population” to be all 25,632 men surveyed in the March
1988 Current Population Survey in wagedata1988.JMP
and the sample to be random samples of size 100 like
the one in wagedatasubset.JMP.
“Population”:
Bivariate Fit of wage By educ
18000
16000
Linear Fit
14000
wage = -19.06983 + 50.414381*educ
wage
12000
 0  19.07
1  50.41
10000
8000
6000
4000
2000
0
0 1 2 3 4 5 6 7 8 9 10
educ
12
14
16
18
Samples of wage data
• To take samples in JMP, click the Tables menu, then click Subset
and then click the circle next to Random Sample Size and set the
sample size. JMP will create a new data subset which is a random
sample of the original data set.
Sample 1:
Sample 2:
Bivariate Fit of wage By educ
Bivariate Fit of wage By educ
3000
2500
2500
2000
wage
wage
2000
1500
1000
1500
1000
500
500
0
0
2
4
6
8
10
12
14
16
18
20
0
5
10
educ
educ
Linear Fit
Linear Fit
wage = -288.6577 + 71.530586*educ
wage = 188.82961 + 38.453459*educ
b0  288.66
b0  188.83
b1 
b1  38.45
71.53
15
20
Sampling distributions
• Only sample, not population, is usually available
so we need to understand sampling distribution.
• Sampling distribution of b1
–
–
E(b1 )  1
Var (b1 ) 
 e2
(n  1) sx 2
1 n
1 n
2
s 
( xi  x ) , x   xi

n  1 i 1
n i 1
2
x
– Sampling distribution is normally distributed.
– Even if normality assumption fails, sampling distributions
of b1
are still approximately normal if n>30.
Properties of b and b as
estimators of  and 
1
0
0
1
• Unbiased Estimators: Mean of the sampling
distribution is equal to the population parameter
being estimated.
• Consistent Estimators: As the sample size n
increases, the probability that the estimator will
become as close as you specify to the true
parameter converges to 1.
• Minimum Variance Estimator: The variance of
the estimator b1 is smaller than the variance of
any other linear unbiased estimator of 1 , say b1 *
Confidence Intervals
• Point Estimate: b1
• Confidence interval: range of plausible values for the
true slope 1
• (1   )100% Confidence Interval: b1  t  / 2 ,n  2 sb1
sb1  se
1
2
(n  1) sx
where
is an estimate of the standard
deviation of b1 ( se  RMSE )
Typically we use a 95% CI.
• 95% CI is approximately b1  2* sb1
95% CIs for a parameter are usually approximately
point estimate  2*Standard Error (point estimate)
where the standard error of the point estimate is an
estimate of the standard deviation of the point estimate.
Computing Confidence Interval with
JMP

In the Fit Line output in JMP, information for computing the confidence interval for
1 is
given under Parameter Estimates..
Parameter Estimates
Term
Intercept
educ
Estimate
-89.74965
51.225264
Std Error of slope for educ =
Std Error
173.4267
12.82813
t Ratio
-0.52
3.99
Prob>|t|
0.6060
0.0001
sb1
Approximate 95% confidence Interval for
1 : b1  2* sb  51.225  2*12.828  (25.57, 76.88)
1
The exact 95% confidence interval can be computed by moving the mouse to the Parameter Estimates, right
clicking, clicking Columns and then clicking Lower 95% and Upper 95%.
Parameter Estimates
Term
Intercept
educ
Lower 95%
-433.9092
25.768251
Exact 95% confidence interval for
Upper 95%
254.40995
76.682276
1 : (25.77, 76.68)
Interpretation:
Increase in mean wages for one extra year of education is likely to be between 25.77 and 76.68 based
on the sample in wagedatasubset.JMP
Summary
• We have described the assumptions of the
simple linear regression model and how to
check them.
• We have come up with a method of describing
the uncertainty in our estimates of the slope and
the intercept via confidence intervals.
• Note: These confidence intervals are only
accurate if the assumptions of the simple linear
regression model are approximately correct.
• Next class: Hypothesis tests.