Download D:\lab\how-tos\SAS Refs\PROC UNIVARIATE.wpd

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
SAS PROC UNIVARIATE
1
PROC UNIVARIATE
This procedure computes statistics about the distribution of a numeric
variable. The output includes a computation of the skewness, kurtosis, and
Shapiro-Wilk test. In addition, a stem and leaf plot, a box plot, and a
normal probability plot can be requested. These plots are character plots.
High resolution plots of the histogram, the normal probability plot, and the
quantile-quantile plot (q-q plot) are discussed in Sections 2 and 3.
The general format of this procedure is:
PROC UNIVARIATE DATA=dataset <options>;
VAR variable1 variable2 etc.;
RUN;
If multiple variables are listed in the VAR statement, then the statistics and
plots are computed for each variable listed. If no VAR statement is used,
the statistics are computed on all the numeric variables in the data set.
Example 1
This looks at the output of PROC UNIVARIATE when no additional
information is requested. Using the country data, we’ll look at the
population in 1992.
DATA country;
INFILE '/u2/example/Country.dat' FIRSTOBS=2 DLM='09'x;
INPUT cont $ country $ pop92 urban gdp lifeexpm lifeexpf
birthrat deathrat;
RUN;
PROC UNIVARIATE DATA=country;
VAR pop92;
RUN;
The output for this code follows:
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
2
The SAS System
1
10:14 Tuesday, June 2, 2009
The UNIVARIATE Procedure
Variable: pop92
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
122
40.7485574
134.819843
7.0532748
2401917.47
330.857953
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
122
4971.324
18176.39
53.137302
2199343.19
12.206015
Basic Statistical Measures
Location
Mean
Median
Mode
Variability
40.74856
9.90000
.
Std Deviation
Variance
Range
Interquartile Range
134.81984
18176
1169
22.91100
Tests for Location: Mu0=0
Test
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
3.3384
61
3751.5
0.0011
<.0001
<.0001
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
1169.619
886.362
121.644
59.640
27.351
9.900
4.440
2.376
1.574
1.082
0.739
Extreme Observations
-----Lowest----
------Highest-----
Value
Obs
Value
Obs
0.739
1.082
1.106
1.285
1.300
61
6
19
47
27
158.000
195.000
256.561
886.362
1169.619
57
74
67
84
68
 End of Example 1
1.
Keywords in the PROC Statement
These keywords are put in the PROC statement.
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
3
ALPHA=
This sets the level of significance of " for (1-")100% confidence
limits. The default value is 0.05. This parameter only needs to
be used if " differs from 0.05.
CIPCTLDF
This requests a distribution free confidence interval for the
quantiles.
MU0=
This is set equal to a value to be used in the null hypothesis to
test :0 in the output section labeled Tests for Location. To test
for :0 = 10, use MU0=10.
NORMAL
requests statistics to test if the distribution of the data is normal.
One of these tests is the Shapiro-Wilk test.
PLOTS
requests the plotting of the stem and leaf plot, box plot and a
normal probability plot. These plots are character plots and are
not high resolution graphics. The high resolution normal
probability plot is discussed in Section 2.
Example 2
The following is an example of using PROC UNIVARIATE on the jackknife
residuals. In order to further analyze any of the residuals, they must be in a
SAS data set. The OUTPUT statement in PROC REG was used to create the
data set rescount used in this procedure.
The code to read the data, perform the regression and put the residuals into
a data set follows:
DATA country;
INFILE '/u2/example/Country.dat' FIRSTOBS=2 DLM='09'x;
INPUT cont $ country $ pop92 urban gdp lifeexpm lifeexpf
birthrat deathrat;
RUN;
PROC REG DATA=country;
MODEL lifeexpf = birthrat;
OUTPUT OUT=rescount RSTUDENT=jackknife R=resid;
RUN;
QUIT;
To analyze the distribution of the jackknife residual, we use PROC
UNIVARIATE as below.
PROC UNIVARIATE DATA=rescount PLOTS NORMAL;
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
4
VAR jackknife;
RUN;
The following is the partial output which shows what is created by the
NORMAL keyword.
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
W
D
W-Sq
A-Sq
Pr
Pr
Pr
Pr
0.977718
0.094688
0.202382
1.037651
<
>
>
>
W
D
W-Sq
A-Sq
0.0422
<0.0100
<0.0050
0.0096
Below are the character plots created by the keyword PLOTS. The stem and
leaf plot could be replaced by a dot plot of the number of values in the data
is sufficiently large.
Stem
3
2
2
1
1
0
0
-0
-0
-1
-1
-2
-2
-3
Leaf
4
56777888
22222344
555566667777778999
0001111111122333333344444
4433333222221111111111000000
9987766655555
433111000
997655
310
5
2
----+----+----+----+----+---
#
1
Boxplot
0
8
8
18
25
28
13
9
6
3
1
1
|
|
+-----+
|
|
*--+--*
+-----+
|
|
0
0
0
Normal Probability Plot
3.25+
*
|
+
|
+++++
|
****** *
|
+*****
|
+******
|
*******
|
********
|
****++
|
****+
|
+*****
| ++++**
|+ *
-3.25+*
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2
 End of Example 2
Example 3
In this example the 99% confidence interval for the quantiles is requested
and the test of location for :0 = 9. The data for this example is the country
data and the variable to be analyzed is the population in 1992.
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
5
DATA country;
INFILE '/u2/example/Country.dat' FIRSTOBS=2 DLM='09'x;
INPUT cont $ country $ pop92 urban gdp lifeexpm lifeexpf
birthrat deathrat;
RUN;
PROC UNIVARIATE DATA=country MU0=9 CIPCTLDF ALPHA=0.01;
VAR pop92;
RUN;
Setting MU0=9 changes the Tests for Location to test the null hypothesis
:0=9.
Tests for Location: Mu0=9
Test
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
2.601058
3
1321.5
0.0105
0.6510
0.0006
The CIPCTLDF and ALPHA=0.01 changes the Quantiles portion of the
UNIVARIATE output to add the confidence limits and order statistics.
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
1169.619
886.362
121.644
59.640
27.351
9.900
4.440
2.376
1.574
1.082
0.739
Quantiles (Definition 5)
99% Confidence Limits
-------Order Statistics------Distribution Free
LCL Rank UCL Rank
Coverage
256.561
59.640
44.149
17.631
7.515
2.792
1.106
0.739
0.739
1169.619
1169.619
256.561
56.386
16.095
6.828
3.187
2.376
1.106
120
110
102
80
47
18
3
1
1
122
122
120
105
76
43
21
13
3
58.26
98.99
99.02
99.04
99.16
99.04
99.02
98.99
58.26
 End of Example 3
2.
Histogram
A histogram of the variable can also be requested using the statement:
HISTOGRAM;
This statement requests a histogram of the variables listed in the VAR
statement. The histogram is a high resolution plot and is displayed in the
SAS Graph window.
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
6
To have a normal distribution curve fitted on the histogram for comparison
the keyword NORMAL can be put after a slash. This will use the mean and
standard deviation from the data.
HISTOGRAM/NORMAL;
For other distributions, see the SAS Help and Documentation from the SAS
Help Menu or the SAS online documentation
(http://support.sas.com/91doc/docMainpage.jsp).
Example 4
Using the data set rescount created in Example 2 above the following code
requests the analysis of the variable jackknife and a histogram.
PROC UNIVARIATE DATA=rescount;
VAR jackknife;
HISTOGRAM / NORMAL;
RUN;
The histogram with the normal density curve is displayed in Figure 1.
Figure 1 Histogram of Jackknife Residuals
 End of Example 4
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
3.
7
Normal Probability Plot and Quantile-Quantile Plot
Two other plots which used to test if data is normally distributed are the
normal probability plot and the quantile-quantile plot (q-q plot).
The normal probability plot displays the ordered data against the percentiles
of a normal distribution. The variable to be tested is plotted on the y-axis
and the percentiles are plotted on the x-axis. To request this plot, the
following statement is placed in the PROC UNIVARIATE step.
PROBPLOT/NORMAL(MU=EST SIGMA=EST);
This will create a normal probability plot with a reference line.
Another plot used in testing the normality of the residuals is the q-q plot.
Like the normal probability plot, SAS plots the residual on the y-axis and the
normal quantiles on the x-axis. The q-q plot is requested using the following
statement.
QQPLOT/NORMAL(MU=EST SIGMA=EST);
This statement requests a q-q plot along with a reference line.
Example 5
Further to Example 2, the following PROC UNIVARIATE step, performs
analysis of the jackknife residual, and requests the normal probability and qq plots.
PROC UNIVARIATE DATA=rescount;
VAR jackknife;
PROBPLOT/NORMAL(MU=EST SIGMA=EST);
QQPLOT/NORMAL(MU=EST SIGMA=EST);
RUN;
The next 2 figures are the plots generated by this request.
Last Updated: September 3, 2009
SAS PROC UNIVARIATE
8
Figure 2 Normal Probability Plot of Jackknife Residuals
Figure 3 Quantile-Quantile Plot of Jackknife Residuals
 End of Example 5
Last Updated: September 3, 2009