Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2 Recall the example: Example 1: Suppose we want to know whether students’ GPAs are related to how much they study. A student project group surveyed a random sample of 80 students from their university and asked them their GPA and how many hours a week they study. symbol1 value=dot color=blue Interpol=R; proc gplot data=mydata.studygpa; plot gpa*study; run; quit; proc univariate data=mydata.studygpa; var gpa study; run; quit; proc reg data=mydata.studygpa; model gpa=study; output out=mydata.regfit p=yhat r=resid; run; quit; 1 We used the applet available at http://www.rossmanchance.com/applets/RegSim/RegCoeff.html We are testing the Null and alternative hypotheses: Running the applet 1000 times from a population where there is no relationship between X and Y finds that obtaining a slope as large as slope is very rare (approximately 3 out 1000 times). 2 Hit reset and run the simulation again and look at the histogram for the estimates of the slope b1 . For my simulation, this is centered very close to 0 and has a standard deviation of .028. Notice this is very close to what SAS calculated as the standard error of the slope (.02771). How does SAS calculate this number? 3 The textbook notes this standard error of the slope as s{b1} and the formula is n s{b1} MSE n (X i 1 i X) 2 (Yi Yi )2 / (n 2) i 1 n ( X i X )2 n i 1 (Y Y ) i 1 i 2 i n (n 2) ( X i X ) 2 i 1 16.07787 .02773 78(268.149219) The test statistic is calculated as: b .08938 t 1 3.23 s{b1} .02771 This generates a p-value of .0018 Most of the time we compare the p-value to 0.05. The 0.05 value is called the “level of significance.” Sometimes when we want to be more confident in our conclusion that there is a relationship between Y and X , so we use a level of significance (denoted with ) of 0.01 instead of .05. Some folks might use =.1, but this is not generally accepted in the scientific literature. It is common practice in journals to give the estimate of the slope ( b1 .08938 ), the test statistic ( t 3.23 ) and the p-value (.0018) when reporting Simple Linear Regression results. We can also use s{b1} to estimate with 95% confidence the true slope between study hours and gpa. We use the formula 4 b1 tn* 2 s{b1} .08938 1.991(.02771) (.03421,.14455) Applet available at http://www.stat.tamu.edu/~west/applets/tdemo.html proc reg data=mydata.studygpa; model gpa=study /clb; output out=mydata.regfit p=yhat r=resid; run; quit; Interpretation: Another way to do the hypothesis test is to determine if 0 is in the confidence interval or not. Rules: 5 We can calculate other confidence intervals by adding the line alpha= option to the model statement. proc reg data=mydata.studygpa; model gpa=study / alpha=.01 clb; output out=mydata.regfit p=yhat r=resid; run; quit; Interpretation: The default is alpha=.05 which generated 95% confidence intervals, when alpha=.1, SAS generates 90% confidence intervals and when alpha=.01, SAS generates 99% confidence intervals. Sometimes, we want to test a hypothesis for a slope other than 0. We may have some prior belief on what the relationship between X and Y should be. Suppose we believed prior to collecting any data that adding an extra hour of study time per week should improve a GPA by .2 points. This is how we test that. First we modify our Null and Alternative Hypothesis: We calculate the test statistic slightly differently: t t b1 1 , F t2 s{b1} .08938 .2 3.9921, F (3.9921)2 15.94 .02771 6 The SAS Code to do this test is: proc reg data=mydata.studygpa; model gpa=study ; test study=.2; output out=mydata.regfit p=yhat r=resid; run; quit; The p-value is 0.0001. We reject the null hypothesis and conclude: SKIP Section 2.2, 2.3, 2.4, 2.5, 2.6 of your text. 7 Predictions Bands in Regression Not only can we predict Y, we can predict the mean value of Y for a given, that is also Y , but we give a 95% confidence interval for the mean of Y given X. Here we use the estimate of the standard deviation of the residuals . For out study this is the Root MSE = .45373. The formula is a bit ugly, but for a given point X h , we have Yh b0 b1 X h the confidence interval has the form X X n 1 ( X X ) 2 Yh t * n2 1 n h n 2 i i 1 So if we let X h 2 hours of study, we can calculate the 95% confidence interval for the mean GPA for all students who study 2 hours per week using the above formula. Interpretation: We can also make a confidence interval for the value of Yh for a single observation for a given Xh. X X n 1 ( X X ) 2 Yh t * n2 1 1 n h n i 1 2 i 8 So, now I can be 95% sure that a student’s GPA will be in the interval generated by the above formula when someone says he or she studies 2 hours per week. Interpretation We can have SAS calculate these things for us. proc reg data=mydata.studygpa; model gpa=study; output out=mydata.regfit p=yhat r=resid lclm=lowmean uclm=upmean lcl=lowpred ucl=uppred; run; symbol1 symbol2 symbol3 symbol4 symbol5 symbol6 color=black value=dot interpol= ; color=black interpol=spline value=none; color=green interpol=spline value=none; color=green interpol=spline value=none; color=blue interpol=spline value=none; color=blue interpol=spline value=none; proc sort data=mydata.regfit; by study; run; quit; proc gplot data=mydata.regfit; plot gpa*study yhat*study lowmean*study upmean*study lowpred*study uppred*study/overlay; label gpa='Grade Point Average'; label study='Hours of Study'; title '95% Prediction Bands for GPA from Hours of Study'; run; quit; 9 Practice Homework: p. 89 2.1a, 2.2, 2.3, 2.4 (use ch1gpa2.sas7bdat), 2.7a and 2.7b (use ch1plastic.sas7bdat). 10