* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Thu Oct 30 - Wharton Statistics
Data assimilation wikipedia , lookup
Choice modelling wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Forecasting wikipedia , lookup
German tank problem wikipedia , lookup
Time series wikipedia , lookup
Confidence interval wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Regression analysis wikipedia , lookup
Lecture 16 – Thurs, Oct. 30 • Inference for Regression (Sections 7.3-7.4): – Hypothesis Tests and Confidence Intervals for Intercept and Slope – Confidence Intervals for mean response – Prediction Intervals • Next time: Robustness of least squares inferences, graphical tools for model assessment (8.1-8.3) Regression • Goal of regression: Estimate the mean response Y for subpopulations X=x, {Y | X } • Example: Y= neuron activity index, X=years playing stringed instrument • Simple linear regression model: {Y | X } 0 1 X • Estimate 0 and 1 by least squares – choose ˆ0 , ˆ1 to minimize the sum of squared residuals (prediction errors) Ideal Model • Assumptions of ideal simple linear regression model – There is a normally distributed subpopulation of responses for each value of the explanatory variable – The means of the subpopulations fall on a straight-line function of the explanatory variable. – The subpopulation standard deviations are all equal (to ) – The selection of an observation from any of the subpopulations is independent of the selection of any other observation. The standard deviation • is the standard deviation in each subpopulation. • measures the accuracy of predictions from the regression. • If the simple linear regression model holds, then approximately – 68% of the observations will fall within of the regression line – 95% of the observations will fall within 2 of the regression line Estimating • Residuals provide basis for an estimate of ˆ sum of all squared residuals degrees of freedom • Degrees of freedom for simple linear regression = n-2 • If the simple linear regression models holds, then approximately – 68% of the observations will fall within ̂ of the least squares line – 95% of the observations will fall within 2̂ of the least squares line Inference for Simple Linear Regression • Inference based on the ideal simple linear regression model holding. • Inference based on taking repeated random samples ( y1,, yn ) from the same subpopulations ( x1,, xn ) as in the observed data. • Types of inference: – – – – Hypothesis tests for intercept and slope Confidence intervals for intercept and slope Confidence interval for mean of Y at X=X0 Prediction interval for future Y for which X=X0 Hypothesis tests for 0 and 1 • Hypothesis test of H 0 : 1 0 vs. H a : 1 0 – Based on t-test statistic, | Estimate | | ˆ1 | | t | SE ( Estimate) SE (ˆ1 ) – p-value has usual interpretation, probability under the null hypothesis that |t| would be at least as large as its observed value, small p-value is evidence against null hypothesis • Hypothesis test for H 0 : 0 0 vs.H a : 0 0 is based on an analogous test statistic. • Test statistics and p-values can be found on JMP output under parameter estimates, obtained by using fit line after fit Y by X. JMP output for example Neuron activity index Bivariate Fit of Neuron activity index By Years playing 30 25 20 15 10 5 0 0 5 10 15 Years playing 20 Linear Fit Linear Fit Neuron activity index = 7.9715909 + 1.0268308 Years playing Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.866986 0.855902 3.025101 15.89286 14 Parameter Estimates Term Intercept Years playing Estimate Std Error t Ratio Prob>|t| 7.9715909 1.206598 6.61 <.0001 1.0268308 0.116105 8.84 <.0001 Confidence Intervals for 0 and 1 • Confidence intervals provide a range of plausible values for 0 and 1 • 95% Confidence Intervals: ˆ0 tn2 (.975)SE(ˆ0 ) ˆ t (.975)SE(ˆ ) 1 n2 1 • Finding CIs in JMP: Can find ˆ0 , ˆ1, SE(ˆ0 ), SE(ˆ1) under parameter estimates after fitting line. Can find tn2 (.975) in Table A.2. tn2 (.975) 2 • For brain activity study, CIs ˆ0 : 7.972 2.179 *1.207 (5.34,10.60) ˆ1 : 1.027 2.179 * 0.116 (0.77,1.28) Confidence Intervals for Mean of Y at X=X0 • What is a plausible range of values for {Y | X 0} • 95% CI for {Y | X 0}: ˆ{Y | X 0} tn2 (.975)SE[ˆ{Y | X 0}] 1 ( X 0 X )2 ˆ{Y | X 0} ˆ0 ˆ1 X 0 • SE[ˆ{Y | X 0}] ˆ , 2 n (n 1) s X • Note about formula – Precision in estimating{Y | X } is not constant for all values of X. Precision decreases as X0 gets farther away from sample average of X’s • JMP implementation: Use Confid Curves fit command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X . Prediction Intervals • What are likely values for a future value Y0 at some specified value of X (=X0)? • The best single prediction of a future response at X0 is the estimated mean response: Pr ed{Y | X 0} ˆ{Y | X 0} ˆ0 ˆ1 X 0 • A prediction interval is an interval of likely values along with a measure of the likelihood that interval will contain response. • 95% prediction interval for X0: If repeated samples (Y1 ,, Yn ) are obtained from the subpopulations ( x1 ,, xn ) and a prediction interval is formed, the prediction interval will contain the value of Y0 for a future observation from the subpopulation X0 95% of the time. Prediction Intervals Cont. • Prediction interval must account for two sources of uncertainty: • – Uncertainty about the location of the subpopulation mean {Y | X 0} – Uncertainty about where the future value will be in relation to its mean Y Pr ed{Y | X 0} Y ˆ{Y | X 0 } [Y {Y | X 0}] [{Y | X 0} ˆ{Y | X 0}] • Prediction Error = Random Sampling Error + Estimation Error Prediction Interval Formula • 95% prediction interval at X0 ˆ{Y | X 0} tn2 (.975) ˆ 2 SE[ ˆ{Y | X 0 }]2 • Compare to 95% CI for mean at X0: ˆ{Y | X 0} tn2 (.975)SE[ˆ{Y | X 0}] – Prediction interval is wider due to random sampling error in future response – As sample size n becomes large, margin of error of CI for mean goes to zero but margin of error of PI doesn’t. • JMP implementation: Use Confid Curves Indiv command under red triangle next to Linear Fit after using Fit Y by X, fit line. Use the crosshair tool to find the exact values of the confidence interval endpoints for a given X0. Example • A building maintenance company is planning to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. The costs incurred by the maintenance company are proportional to the number of crews needed for this task. Currently the company has 11 crews. Will 11 crews be enough? • Recent data are available for the number of rooms that were cleaned by varying number of crews. The data are in cleaning.jmp. • Assuming a simple linear regression model holds, which is more relevant for answering the question of interest – a confidence interval for the mean number of rooms cleaned by 11 crews or a prediction interval for the number of rooms cleaned on a particular day by 11 crews? Correlation • Section 7.5.4 • Correlation is a measure of the degree of linear association between two variables X and Y. For each unit in population, both X and Y are measured. • Population correlation = Mean[{ X X }{Y Y } /( X Y )}] • Correlation is between –1 and 1. Correlation of 0 indicates no linear association. Correlations near +1 indicates strong positive linear association; correlations near –1 indicate strong negative linear Correlation and Regression • Features of correlation – Dimension-free. Units of X and Y don’t matter. – Symmetric in X and Y. There is no “response” and “explanatory” variable. • Correlation only measures degree of linear association. It is possible for there to be an exact relationship between X and Y and yet sample correlation coefficient is zero. • Correlation in JMP: Click multivariate and put variables in Y, columns. • Connection to regression – Test of slope H 0 :1 0 vs. H a : 0 is identical to test of H 0 : 0 vs. H a : 1 .0 Test of correlation coefficient only makes sense if the pairs (X,Y) are randomly sampled from population. Correlation in JMP