Download Classic Regression Analysis in SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Regression & Correlation
(p. 215)
¾ When
two variables are measured on
a single experimental unit, the
resulting data are called bivariate
data.
¾ You can describe each variable
individually, and you can also explore
the relationship between the two
variables.
Simple Linear Regression &
Correlation (p.214)
For quantitative variables one could
employ methods of regression analysis.
Regression analysis is an area of
statistics that is concerned with
finding a model that describes the
relationship that may exist between
variables and determining the validity
of such a relationship.
Examples
™ Do housing prices vary according to
distance to a major freeway?
™ Does respiration rate vary with altitude?
™ Is snowfall related to elevation and if so,
what kind of relationship is there between
these two variables.
™ Speaking of snow, let’s consider wind chill
Example 10.1 (p. 214)
Suppose we are interested in
determining the wind chill temperature.
For those of us from regions where the
winters are extremely cold (like North
Dakota), we know that this temperature
is dependent upon variables such as the
wind velocity (speed and direction), the
absolute temperature, relative humidity,
etc.
Is the wind chill temp important?
Dependent (response) variable:
wind chill temperature
Independent (regressor/predictor)
variables:
temp, wind velocity, relative
humidity
z
California: +850
z
Minneapolis:
-230
Wind chill temp:
-780
What do you say to
that?????
z
Pretty
cold if
you
ask
me!
1
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Regression analysis allows us
¾ to represent the relationship between the
variables.
¾ to examine how the variable of interest
(wind chill), often called the dependent or
response variable is affected by one or
more control or independent variables (wind
speed, actual temperature, relative
humidity).
Correlation analysis will be used as a
measure of the strength of the given
relationship.
Note the following concepts:
¾ Quantitative variables may be
classified according to types.
To study the relationship between
variables, one could use the following
as guides:
¾Start by preparing a graph
(scatterplot).
¾Examine the graph for an overall
pattern and deviations from that
pattern (check for outliers, etc.).
It provides us with
½ a simplified view of the relationship
between variables,
½ a way of fitting a model with our data,
and
½ a means for evaluating the importance of
the variables included in the model and the
correctness of the model.
Response variable: a variable whose changes
are of interest to an experimenter.
Explanatory variable: a variable that
explains or causes changes in a response
variable
NOTE: We will generally denote the
explanatory variable by x and the response
variable by y.
Scatterplots
™ Plot explanatory (independent) variable
on horizontal axis & response variable on
the vertical axis
™ Look for pattern: form, direction &
strength of relationship
™ Note the following:
¾Add numerical descriptive measures
for additional information and support.
2
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Association
Positive association: large values of one
variable correspond to large values of
the other
Negative association: large values of
one variable correspond to small values
of the other
Scatterplot of Diving Reflex
EXAMPLE 10.3 (p. 215): Physicians have used
the so-called diving reflex to reduce
abnormally rapid heartbeats in humans by
submerging the patient's face in cold
water. (The reflex, triggered by cold water
temperatures, is an involuntary neural
response that shuts off circulation to the
skin, muscles, and internal organs, and
diverts extra oxygen-carrying blood to the
heart, lungs, and brain.) A research
physician conducted an experiment to
investigate the effects of various cold
water temperatures on the pulse rates of 10
children with the following results: (See
Lecture Notes)
Correlation (p. 220)
If two variables are related in such a way
that the value of one is indicative of the
value of the other, we say the variables
are correlated.
The correlation coefficient, ρ is a
measure of the strength of the linear
relationship between two variables.
Data looks reasonably linear with redpr
decreasing as temp increases
See formulas on this page.
SOME NOTES (p. 221)
¾ The closer r is to ± 1, the
stronger the linear relationship.
¾The closer r is to 0, the weaker
the linear relationship.
¾ If r = ± 1, the relationship is
perfectly linear (all the points lie
exactly on the line).
SOME NOTES (p. 221)
¾r > 0 → as x increases, y
increases (positive association).
¾r < 0 → as x increases, y
decreases (negative association).
¾r = 0 → no linear association
3
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Your Task
¾Read general guidelines
PROC CORR (p. 223)
• Produces correlation matrix which
lists the Pearson's correlation
coefficients between all sets of
included variables.
• Produces descriptive statistics and
the p-value for testing the
population correlation coefficient ρ
= 0 for each set of variables.
GENERAL FORM
proc corr data = dataset name
options;
by variables;
var variables;
with variables;
partial variables;
See Lecture Notes for options.
EXAMPLE 10.11 (p. 224)
SAS (p. 223)
Proc corr;
var list of variables;
NOTE: If you do not specify a
list of variables, SAS will report
the correlation between all pairs
of variables.
options nocenter nodate ps=55 ls=70
nonumber nodate;
Refer to the previous example on diving
reflex. Use SAS to find the
correlation between reduction in pulse
rate and cold water temperature.
/* Set up temporary SAS dataset named
diving */
We write the following SAS code:
datalines;
data diving;
input temp redpr @@;
68 2 65 5 70 1 62 10 60 9
55 13 58 10 65 3 69 4 63 6
4
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
/* Use proc corr to obtain correlation
noprob
suppress printing of p-value for
testing rho = 0
Example 10.11 (p. 224)
nosimple suppress printing of desc stat */
proc corr noprob nosimple;
var temp redpr;
run;
Quit;
temp
redpr
temp
1
-0.94135
redpr
-0.94135
1
NOTE: The correlation matrix is
symmetric with 1’s along the main
diagonal and the correlation along the
other diagonal.
NOTE
Corr(X,X) = 1
Corr(temp,temp) = 1
Corr(X,Y) = Corr(Y,X)
SIMPLE LINEAR REGRESSION
GOAL: Find the equation of the line
that
best
describes
the
linear
relationship between the dependent
variable and a single independent
variable
Simple ↔ single independent variable
Linear ↔ equation of a line
Value & Interpretation
R = -0.94135 → strong inverse
linear relationship between reduction
in pulse rate and cold water
temperatures.
Deterministic Model:
y = β 0 + β1 x
¾ Requires that all points lie
exactly on the line
¾ Perfect linear relationship
linear in the parameters
5
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Probabilistic Model:
y = β 0 + β1 x + ε
¾Does NOT require that all points
lie exactly on the line
¾Allows for some error/deviation
from the line
Methods of Least Squares
β0 and β1 are unknown
parameters and need to be
estimated.
Want to estimate so that errors
are minimized
S xy
ˆ
b = β1 =
S xx
Estimate of
slope
a = βˆ0 = y − βˆ1 x
Estimate of
y-intercept
For a particular value of x:
Vertical distance = (observed value of y) –
(predicted value of y obtained from estimated
regression equation)
ε ~ N (0, σ ε2 )
Represents random error, independent
Want to estimate the slope and y-intercept
in such a way that
n
n
n =1
i −1
min SSE = min ∑ ε i2 = min ∑ ( yi − yˆ i )
2
→ yˆ = βˆ0 + βˆ1 x
= a + bx
Estimated regression equation
Least squares regression equation
6
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
PROC REG in SAS (p.230)
EXAMPLE 10.15 (p. 232)
GENERAL FORMAT:
proc reg data = dataset options;
by variables;
model dependent variable =
independent variables / options;
plot yvariable*xvariable symbol /
options;
output out = new dataset
= names;
keywords
Refer to the previous example on diving
reflex. Use SAS to find the estimated
regression equation relating reduction in
pulse rate and cold water temperature.
We add the following SAS code to our
existing code, just before the run
statement:
**See Lecture Notes for options
The REG Procedure
proc reg;
model redpr = temp;
REMEMBER:
model dependent = independent;
Model: MODEL1
Dependent Variable: redpr
Analysis of Variance
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
Model
1
127.69347
127.69347
62.26
<.0001
Error
8
16.40653
2.05082
Corr Total
9
144.10000
Source
Root MSE
1.43207
R-Square
0.8861
Dependent Mean
6.30000
Adj R-Sq
0.8719
Coeff Var
22.73122
Parameter Estimates
Parameter
Standard
Variable DF
Estimate
Error
t Value
Pr > |t|
Intercept 1
55.29417
6.22552
8.88
<.0001
temp
-0.77156
0.09778
-7.89
<.0001
1
Suppose x = 61, then
→ yˆ = βˆ0 + βˆ1 x
= 55.29417 − 0.77156 x
yˆ = 55.29417 − 0.771562(61)
Suppose x = 150. Would you use this
equation?
NO
7
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Suppose x = 34. Would you use
this equation?
NO
Evaluating Regression
Equation (p. 236)
Once we have the regression, we
need to evaluate its effectiveness:
THE LESSON:
•Correlation
BE CAREFUL! This equation is
NOT universally valid.
•Coefficient of Determination
•Test slope
•Validate assumptions
Coefficient of Determination,
R2 (p. 236)
•0 ≤ R2 ≤ 1
•closer R2 gets to 1, the better
fit we have.
•Represents the proportion of variability in
the dependent variable, y, that can be
accounted for by the variability in the
independent variable, x.
•SLR, R2 = (corr coeff)2
•Reduction in SSE by using regression
equation to predict y as opposed to just
using the sample mean
The REG Procedure
regression sum of squares
R2 =
total sum of squares
Model: MODEL1
Dependent Variable: redpr
Analysis of Variance
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
Model
1
127.69347
127.69347
62.26
<.0001
Error
8
16.40653
2.05082
Corr Total
9
144.10000
Source
mod el sum of squares
=
total sum of squares
SSR SSM
=
=
TSS TSS
Root MSE
1.43207
R-Square
0.8861
Dependent Mean
6.30000
Adj R-Sq
0.8719
Coeff Var
22.73122
Parameter Estimates
Parameter
Standard
Variable DF
Estimate
Error
t Value
Pr > |t|
Intercept 1
55.29417
6.22552
8.88
<.0001
temp
-0.77156
0.09778
-7.89
<.0001
1
8
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
R2 = 0.8861 → 88.61% of the variability in
reduction in pulse rate can be accounted for
by the variability in cold water temperature
OR: One can get an 88.61% reduction in the
SSE by using the model to predict the
dependent variable instead just using the
sample mean to predict the dependent
variable
NOTE: This means that approximately 11.39%
of the sample variability in reduction in pulse
rate cannot be accounted for by the current
model.
CI & Tests of Hypothesis
What if slope = 0?
You would have a horizontal line.
Thus knowing x would not help predict y.
So our regression equation would not be
useful!
We can perform a test of hypothesis to
determine whether the slope is 0.
CI & Tests of Hypothesis
EXAMPLE 10.20 (p. 237)
Refer to the diving reflex example
example. Test whether the slope is
significantly different from 0.
Usual t-test
The REG Procedure
Model: MODEL1
EXAMPLE 10.20 Soln
1. H 0 : β1 = 0
2. H a : β1 ≠ 0
Dependent Variable: redpr
Analysis of Variance
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
Model
1
127.69347
127.69347
62.26
<.0001
Error
8
16.40653
2.05082
Corr Total
9
144.10000
Source
Root MSE
1.43207
R-Square
0.8861
Dependent Mean
6.30000
Adj R-Sq
0.8719
Coeff Var
22.73122
Parameter Estimates
Parameter
Standard
Variable DF
Estimate
Error
t Value
Pr > |t|
Intercept 1
55.29417
6.22552
8.88
<.0001
temp
-0.77156
0.09778
-7.89
<.0001
1
9
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
EXAMPLE 10.20 Soln
3. p − value < 0.0001
4. Reject H 0 if p-value < α = 0.05
EXAMPLE 10.20 Soln
5. Since p − value < 0.0001 < 0.05
→ reject H 0
→ conclude the slope is
significantly different from 0
Soft Drink Example
(Handout)
Confidence Intervals
βˆ1 ± tα / 2,n − 2
Point
Estimate
Distribution
pt
A soft drink vendor, set up near a beach
for the summer (clearly summer has not
yet arrived in Riverside), was interested in
examining the relationship between sales
of soft drinks, y (in gallons per day) and
the maximum temperature of the day, x.
2
s
S xx
See Handout for data
Write a SAS program to read in and print
out the data.
Standard deviation of
pt estimate
options ls=78 nocenter nodate ps=55 nonumber;
datalines;
/* Create temporary SAS dataset and enter
data */
90 7.3 95 8.5 101 10.1 95 9.3
data e1q1;
input x y @@;
/* Add titles */
title1 'Statistics 157 Extra SLR Example';
title2 'Winter 2008';
title3 'Linda M. Penas';
87 6.7 97 9.2 102 10.2 88 6.7
88 7.1 99 9.9 101 9.9 83 10.2
;
/* Print the data as a check */
proc print;
run;
title4 'Question 1';
10
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Correlation Coeff for Example
Correlation Output
Find and interpret the correlation between
sales of soft drinks and maximum temp of
the day.
The CORR Procedure
Add the following lines of code:
Pearson Correlation Coefficients, N = 12
2
Variables:
/* Use proc corr to generate correlation information
nosimple
suppress printing of desc. statistics
noprob
suppress printing of p-value for
testing rho=0 */
proc corr nosimple noprob;
var x y;
x
y
x
y
x
1.00000
0.62180
y
0.62180
1.00000
R = 0.62180 → moderate positive linear
relationship between max temp and soft
drink sales.
run;
Regression
Regression Output
Find the estimated regression equation
Parameter Estimates
ŷ = βˆ0 + βˆ1 x
Parameter Standard
/* Use proc reg to generate regression
information
Variable DF
Estimate
Error
Intercept 1
-4.19781
5.17157
-0.81
0.4359
x
0.13808 0.05500
2.51
0.0309
1
t Value
Pr > |t|
model dependent = independent */
yˆ = −4.19781 + 0.13808 x
proc reg;
model y = x;
run;
Coefficient of Determination
Find and interpret the coefficient of
determination.
Root MSE
For each xi:
1.17396
R-Square
0.3866
Dependent Mean 8.75833
Adj R-Sq
0.3253
R2 = 0.3866 → 38.66% of the variability in
reduction in sales can be accounted for by the
variability in max temperature
Bad model!
Intro to Residual Analysis
(p.243)
residuals = ei = observed errors
=
where
yi - y-hati, i = 1,2,. . ., n,
yi = observed value (in the data)
y-hati = corresponding predicted or
fitted value (calculated from equation).
11
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
For a given value of x,
Residual = difference between what
we observe in the data and what is
predicted by the regression equation
= amount the regression equation
has not been able to explain
= observed errors if the model is
correct
Can examine the residuals through the
use of various plots. Abnormalities would
be indicated if
•The plot shows a fan shape. (indicates
violation of common variance assumption)
•Plot shows a definite linear trend.
(indicates the need for a linear term in
the model)
•Plot shows a quadratic shape. (indicates
the need for a quadratic or crossproduct terms in the model)
NOTE: It is often easier to examine the
standardized or studentized residuals.
We can interpret them similarly to zscores:
2 < | std residual | < 3
suspect outlier
| std residual | > 3 extreme outlier
(Outlier = doesn’t seem to fit with the
rest of the data
Quadratic term needed
= seems out of place)
x
y
Fit SE Fit Residual St Resid
E
su xam
sp in
ec e
t to
or s
ex ee
tr if
em t
e her
ou e
tl a r
ie e
rs a
ny
Obs
1 1.0 50.00 30.06
LOOKS RANDOM
12.63
19.94
1.44
2 2.0 110.00 101.03 8.03
8.97
0.53
3 2.0 90.00 101.03
8.03 -11.03
-0.65
4 3.0 150.00 163.86
6.45 -13.86
-0.79
5 3.0 140.00 163.86
6.45 -23.86
-1.36
6 3.0 180.00 163.86 6.45
16.14
0.92
12
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Obs
x
y
Fit SE Fit Residual St Resid
CONCLUSION
7 4.0 190.00 218.54
7.15
-28.54
-1.65
8 6.0 310.00 303.47
8.47
6.53
0.39
9 6.0 330.00 303.47
8.47
26.53
1.59
The plot shows no apparent
pattern.
10 7.0 340.00 333.73
8.16
6.27
0.37
Since 0 < | std res | < 2
11 8.0 360.00 355.84 7.84
4.16
0.25
12 10.0 380.00 375.62 12.54
4.38
0.32
→ no suspect or extreme outliers
either
13 10.0 360.00 375.62 12.54
15.62
-1.12
To get residual and residual plots in SAS:
EXAMPLE 10.26: Diving Reflex Example
proc reg;
/* P = predicted values
R = residuals
Student = studentized residuals (act like z-scores)
output out = datasetname
Fanning out: non-constant
variance
*/
model y = x /P R;
output out = a P = pred R = Resid Student= stdres;
run;
Residual Plot
Generate a residual plot of student
(studentized) residuals versus predicted
values.
To get residual and residual plots in SAS:
EXAMPLE: Soft Drink Example
proc reg;
/* P = predicted values
R = residuals
proc plot vpercent = 70 hpercent = 70;
Student = studentized residuals (act like z-scores)
plot stdres*pred;
output out = datasetname
*/
model y = x /P R;
output out = a P = pred R = Resid Student= stdres;
run;
13
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Residual Info
Residual Plot
The REG Procedure
Model: MODEL1
Dependent Variable: y
Dep Var
Output Statistics
Predicted
Std Error
Std Error
Student
Obs
y
Value
Mean Predict
Residual
Residual
Residual
1
7.3000
8.2290
0.3991
-0.9290
1.104
-0.841
2
8.5000
8.9194
0.3449
-0.4194
1.122
-0.374
3
10.1000
9.7479
0.5198
0.3521
1.053
0.335
4
9.3000
8.9194
0.3449
0.3806
1.122
5
6.7000
7.8148
0.5060
-1.1148
1.059
-1.052
6
9.2000
9.1956
0.3810
0.004426
1.110
0.00399
7
10.2000
9.8860
0.5626
0.3140
1.030
0.305
8
6.7000
7.9529
0.4667
-1.2529
1.077
-1.163
9
7.1000
7.9529
0.4667
-0.8529
1.077
-0.792
10
0.339
9.9000
9.4717
0.4423
0.4283
1.087
0.394
11
9.9000
9.7479
0.5198
0.1521
1.053
0.145
12
10.2000
7.2625
0.6854
2.9375
0.953
3.082
Generate a residual plot of student
(studentized) residuals versus predicted
values.
proc plot vpercent = 70 hpercent = 70;
plot stdres*pred;
PART 2
data e1q2;
input x y @@;
title4 'Question 2';
datalines;
90 7.3 95 8.5 101 10.1 95 9.3
87 6.7 97 9.2 102 10.2 88 6.7
88 7.1 99 9.9 101 9.9
Generate new
information
with the outlier
(83,10.2)
removed
;
proc print;
proc corr nosimple noprob;
var x y;
/* Make sure you use different names for your
residuals so you do not overwrite the old ones */
proc reg;
model y = x /P R;
output out = b P = pred1 R = resid1 Student =
stdres1;
proc plot vpercent = 70 hpercent = 70;
plot stdres1*pred1;
Run;
New Output
Output Statistics
Obs
1
Dep Var
Predicted
Std Error
Std Error
Student
y
Value
Mean Predict
Residual
Residual
Residual
7.4515
0.1114
-0.1515
0.254
-0.597
2
8.5000
8.6716
0.0835
-0.1716
0.264
-0.650
3
10.1000
7.3000
10.1358
0.1262
-0.0358
0.247
-0.145
4
9.3000
8.6716
0.0835
0.6284
0.264
2.380
5
6.7000
6.7194
0.1459
-0.0194
0.235
-0.0823
6
9.2000
9.1597
0.0899
0.0403
0.262
0.154
7
10.2000
10.3799
0.1380
-0.1799
0.240
-0.749
8
6.7000
6.9634
0.1336
-0.2634
0.243
-1.086
9
7.1000
6.9634
0.1336
0.1366
0.243
0.563
10
9.9000
9.6478
0.1052
0.2522
0.256
0.985
11
9.9000
10.1358
0.1262
-0.2358
0.247
-0.957
14
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
Normality of Residuals (add-on)
One should continue to remove
the
potential
outliers
and
generate new models, residuals
etc. until reaching the final
information on pages 6-7.
Normality test
proc univariate normal;
ods select TestsForNormality;
var stdres;
Example
The UNIVARIATE Procedure
Variable: stdres (Studentized Residual)
Tests for Normality
Test
--Statistic---
Shapiro-Wilk
W
Kolmogorov-Smirnov D
Cramer-von Mises
W-Sq
Anderson-Darling
A-Sq
Normality Test
1. H0: errors are normally distributed
2. Ha: errors are not normally distributed
3. TS: p-value =0.2068
-----p Value------
4. RR: Reject H0 if p-value < α = 0.05
0.897717 Pr < W
0.2068
0.250982 Pr > D
0.0739
0.106686 Pr > W-Sq 0.0818
0.577399 Pr > A-Sq 0.0989
5. Since p-value =0.2068 not < α = 0.05→
do not reject H0 → ok to assume errors
are normally distributed
Some Relationships
S xy ≤ 0 → βˆ 1 ≤ 0, r ≤ 0
S xy ≥ 0 → βˆ 1 ≥ 0, r ≥ 0
S xy = 0 → βˆ 1 = 0, r = 0
SOME MORE INFO
Total sum of squares = TSS = Syy
= SSE (sum of squares of the error)
+ SSR (sum of squares due to regression
model)
TSS is constant for a given set of data
SSE and SSR vary depending on the
model – change the model, SSE and SSR
may/will change (but their sum is always
constant = TSS)
15
More Statistics tutorial at www.dumblittledoctor.com
Lecture notes on Regression & SAS example demonstration
n
TSS = S yy = ∑ ( y i − y ) 2
i =1
n
SSE = ∑ ( y i − yˆ ) 2 = S yy − βˆ1S xy
i =1
SSR = TSS − SSE
16