Download Thu Oct 30 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

German tank problem wikipedia , lookup

Time series wikipedia , lookup

Confidence interval wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Lecture 16 – Thurs, Oct. 30
• Inference for Regression (Sections 7.3-7.4):
– Hypothesis Tests and Confidence Intervals for
Intercept and Slope
– Confidence Intervals for mean response
– Prediction Intervals
• Next time: Robustness of least squares
inferences, graphical tools for model
assessment (8.1-8.3)
Regression
• Goal of regression: Estimate the mean response Y
for subpopulations X=x, {Y | X }
• Example: Y= neuron activity index, X=years
playing stringed instrument
• Simple linear regression model: {Y | X }  0  1 X
• Estimate  0 and 1 by least squares – choose ˆ0 , ˆ1
to minimize the sum of squared residuals
(prediction errors)
Ideal Model
• Assumptions of ideal simple linear regression
model
– There is a normally distributed subpopulation of
responses for each value of the explanatory variable
– The means of the subpopulations fall on a straight-line
function of the explanatory variable.
– The subpopulation standard deviations are all equal (to
 )
– The selection of an observation from any of the
subpopulations is independent of the selection of any
other observation.
The standard deviation 
•  is the standard deviation in each
subpopulation.
•  measures the accuracy of predictions from the
regression.
• If the simple linear regression model holds, then
approximately
– 68% of the observations will fall within  of the
regression line
– 95% of the observations will fall within 2 of the
regression line
Estimating

• Residuals provide basis for an estimate of 
ˆ 
sum of all squared residuals
degrees of freedom
• Degrees of freedom for simple linear regression =
n-2
• If the simple linear regression models holds, then
approximately
– 68% of the observations will fall within ̂ of the least
squares line
– 95% of the observations will fall within 2̂ of the least
squares line
Inference for Simple Linear Regression
• Inference based on the ideal simple linear
regression model holding.
• Inference based on taking repeated random
samples ( y1,, yn ) from the same subpopulations
( x1,, xn ) as in the observed data.
• Types of inference:
–
–
–
–
Hypothesis tests for intercept and slope
Confidence intervals for intercept and slope
Confidence interval for mean of Y at X=X0
Prediction interval for future Y for which X=X0
Hypothesis tests for
0
and
1
• Hypothesis test of
H 0 : 1  0 vs. H a : 1  0
– Based on t-test statistic,
| Estimate |
| ˆ1 |
| t |

SE ( Estimate) SE (ˆ1 )
– p-value has usual interpretation, probability under the
null hypothesis that |t| would be at least as large as its
observed value, small p-value is evidence against null
hypothesis
• Hypothesis test for H 0 : 0  0 vs.H a : 0  0 is based
on an analogous test statistic.
• Test statistics and p-values can be found on JMP
output under parameter estimates, obtained by
using fit line after fit Y by X.
JMP output for example
Neuron activity index
Bivariate Fit of Neuron activity index By Years playing
30
25
20
15
10
5
0
0
5
10
15
Years playing
20
Linear Fit
Linear Fit
Neuron activity index = 7.9715909 + 1.0268308 Years playing
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.866986
0.855902
3.025101
15.89286
14
Parameter Estimates
Term
Intercept
Years playing
Estimate
Std Error
t Ratio
Prob>|t|
7.9715909 1.206598 6.61 <.0001
1.0268308 0.116105 8.84 <.0001
Confidence Intervals for
0
and 
1
• Confidence intervals provide a range of plausible
values for  0 and 1
• 95% Confidence Intervals:
ˆ0  tn2 (.975)SE(ˆ0 )
ˆ  t (.975)SE(ˆ )
1
n2
1
• Finding CIs in JMP: Can find ˆ0 , ˆ1, SE(ˆ0 ), SE(ˆ1)
under parameter estimates after fitting line. Can
find tn2 (.975) in Table A.2. tn2 (.975)  2
• For brain activity study, CIs ˆ0 : 7.972  2.179 *1.207  (5.34,10.60)
ˆ1 : 1.027  2.179 * 0.116  (0.77,1.28)
Confidence Intervals for Mean of
Y at X=X0
• What is a plausible range of values for {Y | X 0}
• 95% CI for {Y | X 0}: ˆ{Y | X 0}  tn2 (.975)SE[ˆ{Y | X 0}]
1 ( X 0  X )2
ˆ{Y | X 0}  ˆ0  ˆ1 X 0
• SE[ˆ{Y | X 0}]  ˆ 
,
2
n (n  1) s X
• Note about formula
– Precision in estimating{Y | X } is not constant for all
values of X. Precision decreases as X0 gets farther
away from sample average of X’s
• JMP implementation: Use Confid Curves fit
command under red triangle next to Linear Fit
after using Fit Y by X, fit line. Use the crosshair
tool to find the exact values of the confidence
interval endpoints for a given X .
Prediction Intervals
• What are likely values for a future value Y0 at some
specified value of X (=X0)?
• The best single prediction of a future response at X0 is the
estimated mean response:
Pr ed{Y | X 0}  ˆ{Y | X 0}  ˆ0  ˆ1 X 0
• A prediction interval is an interval of likely values along
with a measure of the likelihood that interval will contain
response.
• 95% prediction interval for X0: If repeated samples (Y1 ,, Yn )
are obtained from the subpopulations ( x1 ,, xn ) and a
prediction interval is formed, the prediction interval will
contain the value of Y0 for a future observation from the
subpopulation X0 95% of the time.
Prediction Intervals Cont.
• Prediction interval must account for two sources
of uncertainty:
•
– Uncertainty about the location of the subpopulation
mean {Y | X 0}
– Uncertainty about where the future value will be in
relation to its mean
Y  Pr ed{Y | X 0}  Y  ˆ{Y | X 0 }
 [Y  {Y | X 0}]  [{Y | X 0}  ˆ{Y | X 0}]
• Prediction Error = Random Sampling Error +
Estimation Error
Prediction Interval Formula
• 95% prediction interval at X0
ˆ{Y | X 0}  tn2 (.975) ˆ 2 SE[ ˆ{Y | X 0 }]2
• Compare to 95% CI for mean at X0:
ˆ{Y | X 0}  tn2 (.975)SE[ˆ{Y | X 0}]
– Prediction interval is wider due to random sampling
error in future response
– As sample size n becomes large, margin of error of CI
for mean goes to zero but margin of error of PI doesn’t.
• JMP implementation: Use Confid Curves Indiv command
under red triangle next to Linear Fit after using Fit Y by X,
fit line. Use the crosshair tool to find the exact values of
the confidence interval endpoints for a given X0.
Example
• A building maintenance company is planning to submit a
bid on a contract to clean 40 corporate offices scattered
throughout an office complex. The costs incurred by the
maintenance company are proportional to the number of
crews needed for this task. Currently the company has 11
crews. Will 11 crews be enough?
• Recent data are available for the number of rooms that
were cleaned by varying number of crews. The data are in
cleaning.jmp.
• Assuming a simple linear regression model holds, which is
more relevant for answering the question of interest – a
confidence interval for the mean number of rooms cleaned
by 11 crews or a prediction interval for the number of
rooms cleaned on a particular day by 11 crews?
Correlation
• Section 7.5.4
• Correlation is a measure of the degree of linear
association between two variables X and Y. For
each unit in population, both X and Y are
measured.
• Population correlation =
Mean[{ X   X }{Y  Y } /( X Y )}]
• Correlation is between –1 and 1. Correlation of 0
indicates no linear association. Correlations near
+1 indicates strong positive linear association;
correlations near –1 indicate strong negative linear
Correlation and Regression
• Features of correlation
– Dimension-free. Units of X and Y don’t matter.
– Symmetric in X and Y. There is no “response” and
“explanatory” variable.
• Correlation only measures degree of linear association. It
is possible for there to be an exact relationship between X
and Y and yet sample correlation coefficient is zero.
• Correlation in JMP: Click multivariate and put variables in
Y, columns.
• Connection to regression
– Test of slope H 0 :1  0 vs. H a :   0 is identical to
test of H 0 :   0 vs. H a : 1  .0 Test of correlation
coefficient only makes sense if the pairs (X,Y) are
randomly sampled from population.
Correlation in JMP