Download Chapter 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 2
Recall the example:
Example 1: Suppose we want to know whether students’ GPAs are related to how much they
study. A student project group surveyed a random sample of 80 students from their university
and asked them their GPA and how many hours a week they study.
symbol1 value=dot color=blue
Interpol=R;
proc gplot data=mydata.studygpa;
plot gpa*study;
run;
quit;
proc univariate data=mydata.studygpa;
var gpa study;
run;
quit;
proc reg data=mydata.studygpa;
model gpa=study;
output out=mydata.regfit p=yhat r=resid;
run;
quit;
1
We used the applet available at http://www.rossmanchance.com/applets/RegSim/RegCoeff.html
We are testing the Null and alternative hypotheses:
Running the applet 1000 times from a population where there is no relationship between X and
Y finds that obtaining a slope as large as slope is very rare (approximately 3 out 1000 times).
2
Hit reset and run the simulation again and look at the histogram for the estimates of the slope b1 .
For my simulation, this is centered very close to 0 and has a standard deviation of .028. Notice
this is very close to what SAS calculated as the standard error of the slope (.02771). How does
SAS calculate this number?
3
The textbook notes this standard error of the slope as s{b1} and the formula is
n
s{b1} 
MSE
n
(X
i 1
i
 X)

2
 (Yi  Yi )2 / (n  2)
i 1
n
 ( X i  X )2
n

i 1
 (Y  Y )
i 1
i
2
i
 n

(n  2)   ( X i  X ) 2 
 i 1

16.07787
 .02773
78(268.149219)
The test statistic is calculated as:
b
.08938
t 1 
 3.23
s{b1} .02771

This generates a p-value of .0018
Most of the time we compare the p-value to 0.05. The 0.05 value is called the “level of
significance.” Sometimes when we want to be more confident in our conclusion that there is a
relationship between Y and X , so we use a level of significance (denoted with  ) of 0.01
instead of .05. Some folks might use  =.1, but this is not generally accepted in the scientific
literature.
It is common practice in journals to give the estimate of the slope ( b1  .08938 ), the test statistic
( t  3.23 ) and the p-value (.0018) when reporting Simple Linear Regression results.
We can also use s{b1} to estimate with 95% confidence the true slope between study hours and
gpa. We use the formula
4
b1  tn* 2 s{b1}  .08938  1.991(.02771)  (.03421,.14455)
Applet available at http://www.stat.tamu.edu/~west/applets/tdemo.html
proc reg data=mydata.studygpa;
model gpa=study /clb;
output out=mydata.regfit p=yhat r=resid;
run;
quit;
Interpretation:
Another way to do the hypothesis test is to determine if 0 is in the confidence interval or not.
Rules:
5
We can calculate other confidence intervals by adding the line alpha= option to the model
statement.
proc reg data=mydata.studygpa;
model gpa=study / alpha=.01 clb;
output out=mydata.regfit p=yhat r=resid;
run;
quit;
Interpretation:
The default is alpha=.05 which generated 95% confidence intervals, when alpha=.1, SAS
generates 90% confidence intervals and when alpha=.01, SAS generates 99% confidence
intervals.
Sometimes, we want to test a hypothesis for a slope other than 0. We may have some prior
belief on what the relationship between X and Y should be. Suppose we believed prior to
collecting any data that adding an extra hour of study time per week should improve a GPA by .2
points. This is how we test that. First we modify our Null and Alternative Hypothesis:
We calculate the test statistic slightly differently:
t
t
b1  1
, F  t2
s{b1}
.08938  .2
 3.9921, F  (3.9921)2  15.94
.02771
6
The SAS Code to do this test is:
proc reg data=mydata.studygpa;
model gpa=study ;
test study=.2;
output out=mydata.regfit p=yhat r=resid;
run;
quit;
The p-value is 0.0001.
We reject the null hypothesis and conclude:
SKIP Section 2.2, 2.3, 2.4, 2.5, 2.6 of your text.
7
Predictions Bands in Regression
Not only can we predict Y, we can predict the mean value of Y for a given, that is also Y , but we
give a 95% confidence interval for the mean of Y given X. Here we use the estimate of the
standard deviation of the residuals  . For out study this is the Root MSE = .45373. The
formula is a bit ugly, but for a given point X h , we have
Yh  b0  b1 X h
the confidence interval has the form
X  X 
 n  1  ( X  X )
2
Yh  t
*
n2
1


n
h
n
2
i
i 1
So if we let X h  2 hours of study, we can calculate the 95% confidence interval for the mean
GPA for all students who study 2 hours per week using the above formula.
Interpretation:
We can also make a confidence interval for the value of Yh for a single observation for a given
Xh.
X  X 
 n  1  ( X  X )
2
Yh  t
*
n2
1
 1 
n
h
n
i 1
2
i
8
So, now I can be 95% sure that a student’s GPA will be in the interval generated by the above
formula when someone says he or she studies 2 hours per week.
Interpretation
We can have SAS calculate these things for us.
proc reg data=mydata.studygpa;
model gpa=study;
output out=mydata.regfit p=yhat r=resid lclm=lowmean
uclm=upmean lcl=lowpred ucl=uppred;
run;
symbol1
symbol2
symbol3
symbol4
symbol5
symbol6
color=black value=dot interpol= ;
color=black interpol=spline value=none;
color=green interpol=spline value=none;
color=green interpol=spline value=none;
color=blue interpol=spline value=none;
color=blue interpol=spline value=none;
proc sort data=mydata.regfit;
by study;
run;
quit;
proc gplot data=mydata.regfit;
plot gpa*study yhat*study lowmean*study upmean*study
lowpred*study uppred*study/overlay;
label gpa='Grade Point Average';
label study='Hours of Study';
title '95% Prediction Bands for GPA from Hours of Study';
run;
quit;
9
Practice Homework:
p. 89 2.1a, 2.2, 2.3, 2.4 (use ch1gpa2.sas7bdat), 2.7a and 2.7b (use ch1plastic.sas7bdat).
10
Related documents