Download RUN

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
EPIB 698D Lecture 6
Raul Cruz-Cano
Spring 2013
DETERMINING NORMALITY (OR LACK
THEREOF)
• One of the first steps in test selection should be
investigating the distribution of the data.
• PROC UNIVARIATE can be implemented to determine
whether or not your data are normal.
– If the population from which the data are obtained is normal,
the mean and median should be equal or close to equal.
– The skewness coefficient, which is a measure of symmetry,
should be near zero. Positive values for the skewness coefficient
indicate that the data are right skewed, and negative values
indicate that that data are left skewed.
– The kurtosis coefficient, which is a measure of spread, should
also be near zero.
– Positive values for the kurtosis coefficient indicate that the
distribution of the data is steeper than a normal distribution,
and negative values for kurtosis indicate that the distribution of
the data is flatter than normal distribution.
DETERMINING NORMALITY (OR LACK
THEREOF)
• The NORMAL option in PROC UNIVARIATE produces a table with tests for
normality.
– Shapiro-Wilk Statistic, EDF Goodness-of-Fit Tests, Kolmogorov D Statistic,
Anderson-Darling Statistic, Cramér-von Mises Statistic
– In general, if the p-values are less than 0.05, then the data should be
considered non-normally distributed.
– However, it is important to remember that these tests are heavily dependent
on sample size.
– Strikingly non-normal data may have a p-value greater than 0.05 due to a
small samplesize. Therefore, graphical representations of the data should
always be examined.
• The PLOTS option in PROC UNIVARIATE creates low-resolution stem-andleaf, box, and normal probability plots.
– The stem-and-leaf plot is used to visualize the overall distribution of the data
and the box plot is a graphical representation of the 5-number summary.
– The normal probability plot is designed to investigate whether a variable is
normally distributed. If the data are normal, then the plot should display a
straight diagonal line. Different departures from the straight diagonal line
indicate different types of departures from normality
DETERMINING NORMALITY
• The HISTOGRAM statement in PROC UNIVARIATE will produce high
resolution histograms.
• PROC UNIVARIATE is an invaluable tool in visualizing and summarizing data
in order to gain an understanding of the underlying populations from
which the data are obtained. To produce these results, the following code
can be used.
PROC UNIVARIATE data=datafile normal plots;
Histogram;
Var variable1 variable2 ... variablen;
Run;
• The determination of the normality of the data should result from
evaluation of the graphical output in conjunction with the numerical
output.
• In addition, the user might wish to look at subsets of the data; for
example, a CLASS statement might be used to stratify by gender.
Normality Test: Box Plot
One-sample t-test?
DATA relieftime;
INPUT relief;
DATALINES;
90
93
93
99
98
100
103
104
99
102
;
PROC UNIVARIATE DATA = relieftime normal plot;
VAR relief;
histogram relief / midpoints = 80 to 120 by 5 normal;
RUN;
When used in conjunction with the NORMAL option, the histogram will have a
line indicating the shape of a normal distribution with the same mean and
variance as the sample.
Tests for Normality
• The histogram shows most observations
falling at the peak of the normal curve.
• The box-plot shows that the mean falls on the
median (*--+--*), indicating no skewed data.
• The formal tests of normality in the output are
non-significant, indicating these data come
from a normal distribution.
• We can assume the data are normally
distributed and proceed with the one-sample
t-test.
Normality Test: Box Plot
Paired t-test?
DATA study;
INPUT before after;
DATALINES;
90 95
87 92
100 104
80 89
95 101
90 105
;
PROC UNIVARIATE DATA = study normal plot;
VAR before after;
histogram before after / normal;
RUN;
Tests for Normality
• There are so few data points that the
histograms are difficult to interpret.
• The box-plots for before and after both show
the mean very close to the median, suggesting
the data are not skewed.
• The tests of normality for before and after
have p-values > alpha, indicating we do not
reject the assumption of normality.
• We can proceed with the matched pairs t-test.
Tests for Normality
DATA response;
INPUT group $ time;
DATALINES;
Two-sample t-test?
c 80
c 93
c 83
c 89
c 98
t 100
t 103
t 104
t 99
t 102
;
PROC UNIVARIATE DATA = response normal plot;
class group;
var time;
histogram time / midpoints = 80 to 120 by 5 normal;
RUN;
A few notes:
• The code has specified that the univariate
procedure be performed on the variable time,
but that it is done by the class “group.” This
way you will have separate summary statistics,
plots and histograms for the treatment and
control groups.
Tests for Normality
• The tests for normality for both the
treatment and control groups are nonsignificant (p-value > alpha), indicating
we can assume they come from a normal
distribution.
• Because each group only has 5 subjects,
the histograms are difficult to interpret,
but there is no indication of nonnormality.
• Proceed with the two-sample t-test
Histograms for control and treatment
groups
100
c
P
e
r
c
e
n
t
80
60
40
20
0
100
t
P
e
r
c
e
n
t
80
60
40
20
0
80
85
90
95
100
t i me
105
110
115
120
Another Example
• A semiconductor manufacturer produces printed circuit boards that
are sampled to determine the thickness of their copper plating.
• The following statements create a data set named Trans, which
contains the plating thicknesses (Thick) of 100 boards:
data Trans;
input Thick @@;
label Thick = 'Plating Thickness (mils)';
datalines;
3.468 3.428 3.509 3.516 3.461 3.492 3.478 3.556 3.482 3.512 3.490 3.467 3.498
3.519 3.504 3.469 3.497 3.495 3.518 3.523 3.458 3.478 3.443 3.500 3.449
3.525 3.461 3.489 3.514 3.470 3.561 3.506 3.444 3.479 3.524 3.531 3.501
3.495 3.443 3.458 3.481 3.497 3.461 3.513 3.528 3.496 3.533 3.450 3.516
3.476 3.512 3.550 3.441 3.541 3.569 3.531 3.468 3.564 3.522 3.520 3.505
3.523 3.475 3.470 3.457 3.536 3.528 3.477 3.536 3.491 3.510 3.461 3.431
3.502 3.491 3.506 3.439 3.513 3.496 3.539 3.469 3.481 3.515 3.535 3.460
3.575 3.488 3.515 3.484 3.482 3.517 3.483 3.467 3.467 3.502 3.471 3.516
3.474 3.500 3.466 ;
run;
Example
title 'Analysis of Plating Thickness';
proc univariate data=Trans;
histogram Thick / normal(percents=20 40 60 80 midpercents) name='MyPlot';
run;
Q-Q Plots
• The following properties of Q-Q plots and probability plots
make them useful diagnostics of how well a specified
theoretical distribution fits a set of measurements:
– If the quantiles of the theoretical and data distributions agree,
the plotted points fall on or near the line .
– If the theoretical and data distributions differ only in their
location or scale, the points on the plot fall on or near the line .
The slope and intercept are visual estimates of the scale and
location parameters of the theoretical distribution.
• Q-Q plots are more convenient than probability plots for
graphical estimation of the location and scale parameters
because the -axis of a Q-Q plot is scaled linearly.
• On the other hand, probability plots are more convenient
for estimating percentiles or probabilities.
Q-Q plots Example
• Data set Measures, which contains the
measurements of the diameters of 50 steel rods
in the variable Diameter:
data Measures;
input Diameter @@;
label Diameter = 'Diameter (mm)';
datalines;
5.501 5.251 5.404 5.366 5.445 5.576 5.607 5.200 5.977 5.177
5.332 5.399 5.661 5.512 5.252 5.404 5.739 5.525 5.160 5.410
5.823 5.376 5.202 5.470 5.410 5.394 5.146 5.244 5.309 5.480
5.388 5.399 5.360 5.368 5.394 5.248 5.409 5.304 6.239 5.781
5.247 5.907 5.208 5.143 5.304 5.603 5.164 5.209 5.475 5.223
;
run;
Q-Q plots Example
symbol v=plus;
title 'Normal Q-Q Plot for Diameters';
proc univariate data=Measures noprint;
qqplot Diameter / normal square vaxis=axis1;
axis1 label=(a=90 r=0);
run;
Probability Plots
• The PROBPLOT statement creates a probability plot,
which compares ordered variable values with the
percentiles of a specified theoretical distribution. If the
data distribution matches the theoretical distribution,
the points on the plot form a linear pattern.
Consequently, you can use a probability plot to
determine how well a theoretical distribution models a
set of measurements.
• Probability plots are similar to Q-Q plots, which you
can create with the QQPLOT statement. Probability
plots are preferable for graphical estimation of
percentiles, whereas Q-Q plots are preferable for
graphical estimation of distribution parameters.
Probability Plot Example
proc univariate data=Measures;
probplot Length1 Length2 / normal(mu=10 sigma=0.3) square ctext=blue;
run;
You can check
against other
distributions:
lognormal,
Gamma, Betta,
etc.
Collinearity
• When a regressor is nearly a linear combination of other regressors
in the model, the affected estimates are unstable and have high
standard errors.
• This problem is called collinearity or multicollinearity.
• It is a good idea to find out which variables are nearly collinear with
which other variables.
• Consequences of high multicollinearity:
– Increased standard error of estimates of the β’s (decreased reliability).
– Often confusing and misleading results
• The approach in PROC REG follows that of Belsley, Kuh, and Welsch
(1980). PROC REG provides several methods for detecting
collinearity with the COLLIN, COLLINOINT, TOL, and VIF options.
Collinearity
• The COLLIN option in the MODEL statement requests that a
collinearity analysis be performed.
• Belsey, Kuh, and Welsch (1980) suggest that, when this
number is around 10, weak dependencies might be starting
to affect the regression estimates. When this number is
larger than 100, the estimates might have a fair amount of
numerical error (although the statistical standard error
almost always is much greater than the numerical error).
• For each variable, PROC REG produces the proportion of
the variance of the estimate accounted for by each
principal component. A collinearity problem occurs when a
component associated with a high condition index
contributes strongly (variance proportion greater than
about 0.5) to the variance of two or more variables.
Collinearity
• The VIF option in the MODEL statement provides
the variance inflation factors (VIF). These factors
measure the inflation in the variances of the
parameter estimates due to collinearities that
exist among the regressor (independent)
variables. There are no formal criteria for
deciding if a VIF is large enough to affect the
predicted values.
• The TOL option requests the tolerance values for
the parameter estimates
Example Collinearity
Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple
exercise tests. The goal is to develop an equation to predict fitness based on the
exercise tests rather than on expensive and cumbersome oxygen consumption
measurements
data fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
datalines;
44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168
42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176
176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63
174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03
56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855
10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91
46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57
73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166
50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168
172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56
186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172
;
Run;
Example
Collinearity
proc reg data=fitness;
model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / tol vif collin;
run;
dangerously
high VIFs
Look at large
CN rows
See which predictors have large % variance in
each
Solution
• Check correlations then Redefine variables
– Remove or average redundant ones
• Variable selection, model re-specification
– Use forward or backward in the regression
analysis
Let’s revisit the examples from
previous classes…
data blood;
INFILE ‘F:\blood.txt';
INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;
run;
data blood1; set blood;
if gender='Female' then sex=1; else sex=0;
if bloodtype='A' then typeA=1; else typeA=0;
if bloodtype='B' then typeB=1; else typeB=0;
if bloodtype='AB' then typeAB=1; else typeAB=0;
if age_group='Old' then Age_old=1; else Age_old=0;
run;
Check normality and collinearity of RBC, WBC and cholesterol
PROC UNIVARIATE DATA = blood1 normal plot;
class gender;
var RBC;
histogram RBC / normal;
qqplot RBC / normal square ;
RUN;
proc reg DATA = blood1;
model cholesterol =RBC WBC / vif collin;
run;
title 'Paired Comparison';
data pressure;
input SBPbefore SBPafter @@;
diff_BP=SBPafter-SBPbefore ;
datalines;
120 128 124 131 130 131 118 127
140 132 128 125 140 141 135 137
126 118 130 132 126 129 127 135
;
run;
data piared;
input lossa lossj;
diff=lossa-lossj;
datalines ;
+4 -8
+3 -10
0 -12
-3 -16
-4 -18
-5 -20
-11 -21
-14 -24
-15 -26
-300 -30
;
run;
Check normality and collinearity of SBPbefore & SBPafter and lossa & lossb
data nonparametric;
input loss diet $;
datalines ;
proc ttest data = "c:\hsb2";
class female;
var write;
run;
Check normality of write and loss
+4 atkins
+3 atkins
0
atkins
-3 atkins
-4 atkins
-5
atkins
-11 atkins
-14 atkins
-15 atkins
-300 atkins
-8 jenny
-10 jenny
-12 jenny
-16 jenny
-18 jenny
-20 jenny
-21 jenny
-24 jenny
-26 jenny
-30 jenny
;
run;
data Clover;
input Strain $ Nitrogen @@;
datalines;
3DOK1 19.4 3DOK1 32.6 3DOK1 27.0
3DOK1 32.1 3DOK1 33.0 3DOK5 17.7 3DOK5
24.8 3DOK5 27.9 3DOK5 25.2 3DOK5 24.3
3DOK4 17.0 3DOK4 19.4 3DOK4 9.1 3DOK4
11.9 3DOK4 15.8 3DOK7 20.7 3DOK7 21.0
3DOK7 20.5 3DOK7 18.8 3DOK7 18.6 3DOK13
14.3 3DOK13 14.4 3DOK13 11.8 3DOK13 11.6
3DOK13 14.2 COMPOS 17.3 COMPOS 19.4
COMPOS 19.1 COMPOS 16.9 COMPOS 20.8
;
run;
Check normality of Nitrogen