Download Exploring Data With Base SAS® Software

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Exploring Data With Base SAS® Software
Thomas J. Winn Jr., Texas State Comptroller's Office, Austin, Texas
Abstract
exclusive, categorical classification, which mayor may not
possess an inherent order. Numbers may be used
Statistical methods are tools which are used to summarize
qualitatively as cod~d values for names, but arithmetic with
and analyze data. Exploratory data analysis is the application
the values would be meaningless. Examples of qualitative
of graphical and statistical techniques to discover the structure
types of measurement include taxonomic names (such as
of data. The goal of exploratory data analysis is to
values for sex, race, region, political preference, etc.), or
characterize the data and to reveal fundamental relationships
ordinal scales (ordered categories such as: always -
among them. It is quick, dynamic, and highly interactive.
sometimes - never, first - second - third - fourth - fifth, etc.).
Furthennore, exploratory data analysis is not just for use by
Ouantitative types of measurement use numbers as cardinal
professional statisticians - the methods also are used by
magnitudes.
scientists, engineers and many other types of researchers.
This paper explains how to produce and to interpret scatter
With qualitative types of data, the data analysis methods are
plots. histograms, stem-and-Ieaf plots, box-and-whisker plots,
mostly limited to frequency tables, bar charts, and pie charts.
and various descriptive statistics, using Base SAS® software.
However, quantitative types of measurement permit the use of
a greater variety of tools.
Introduction
Overview of Descriptive Statistics
Originally, the SAS System was a combination of programs for
perfonning statistical analysis on data. Since then, the SAS
The goal of data exploration is to comprehend the distribution
System has grown in ways which have made it useful to non
of the values of the variables which comprise the data, and to
statisticians, as well as having increased its value to its
identify some of the important ways in which the variables
earliest audience of users. Most SAS users, including many
appear to be related. Data summarization leans heavily on
without substantial expertise in statistics, have an occasional
statistical measures which pertain to central tendency,
need to utilize some of the statistical and graphical capabilities
dispersion, and shape of the data distribution, as well as on a
of the SAS System. They want to examine their data, but
few graphical techniques for data visualization.
without becoming involved in advanced statistical procedures.
The present paper is addressed to this group of users.
Central tendency refers to a typical value from the distribution.
Three commonly-used measures of central tendency are the
Exploratory data analysis is the interactive use of statistical
arithmetic mean (or average value), the median (the middle-
and graphical procedures to uncover the composition of data;
most value), and the mode (the value which occurs most
that is, to identify their general characteristics and
frequently). In the case of qualitative data, the mode is
relationships. Data exploration is a process in which raw data
commonly used as the central tendency measure.
become comprehensible information through a sequence of
activities, each of which must be adapted according to the
Dispersion refers to the spread of the data values, usually with
outcomes of the preceding steps. It is noted that SAS
respect to a particular measure of central tendency. Some
software now includes JMP®, SAS/INSIGHT®, and SAS/LAB
measures of dispersion are the range, the variance, the
®, which were expressly designed to implement exploratory
standard deviation, the coefficient of variation, and the
techniques. However, many SAS users do not have access to
interquartile range. Range is the difference between the
these components. This paper is intended to present some of
smallest and the largest data values. The standard deviation
the basic tools for data exploration using elements of Base
and the variance both indicate the variability (or the amount of
SAS software, whether they are used in a non interactive
concentration) of the data values with respect to the mean.
mode (such as batch) or in an interactive mode (such as SAS
The coefficient of variation is 100·standard deviation/mean.
Display Manager).
The interquartile range is the distance between the particular
data value below which the bottom one-fourth of the data
Tyoes of Data
values are found (first quartile, 01), and the particular data
value above which the top one-fourth of the data values are
To begin with, there are two basic levels of data: qualitative,
located (third quartile, 03). There is no measure of dispersion
and quantitative. This is not the same thing as the difference
for non-ordinal, qualitative types of data.
between character and numeric variables in SAS. Qualitative
types of measurement may be either character or numeric, the
In addition to central tendency and dispersion, other properties
essential idea is that they involve some type of mutually
which are use'ful in describing the shape of a distribution are
1384
skewness, kurtosis, and the presence of outliers, gaps and
(variable name, type, Jength, informat, format, label) agree
multipJe peaks. Skewness is a measure of the symmetry (or
with what had been anticipated? Do the numeric variables
lack thereof) of the distribution. In a perfectly symmetrical
take on only a limited number of distinct values, or do they
distribution, the mean, the median, and the mode coincide. A
have a very large number of values? Are there any aberrant
distribution is said to be skewed whenever the data values are
values; that is, do the data deviate unreasonably from the
clustered more at one end than at the other, so that its scatter
typical pattern? If there are any data errors, substitute
plot seems to lean unevenly towards one side. A skewness
measure would be zero when the distribution is symmetric; it
corrected values for them.
would be positive when more data points are clustered at the
Now, run PROC MEANS to calculate simple descriptive
lower end than at the upper end (the mean and the median are
statistics for numeric variables in a SAS data set. If no
greater than the mode); and it would be negative when more
particular statistics are specified as options on the PROC
data points are clustered at the upper end than at the lower
MEANS statement then, for each numeric variable, the
end (the mean and the median are less than the mode).
variable name, number of observations, mean, standard
Kurtosis is a measure of the flatness of the distribution. A
deviation, minimum value, and maximum value will be
very large kurtosis number would mean that some of the data
reported.
values are much farther away from the mean than most of the
other data values; when this happens, the distribution is said
PROC MEANS DATA=data-set-name;
to have a "heavy tail". Outliers are data values which are far
VAR variables-list,
away from the rest of the data.
If desired, PROC MEANS also will report the variance, the
PreliminaN Examination of Data
coefficient of variation, the range, the skewness, and the
kurtosis (and more, if desired). If observations can be
Data analysis begins with a cursory review of the
raw data.
grouped together USing certain variables, then a CLASS
Do the values of the variables correspond to quantitative or
statement can be used to obtain summary statistics across
qualitative types of measurements? Do the data values
each classification grouping (without sorting the data!).
confonn to reasonable expectations? Do any of the values
contain obvious typographical errors, or appear to be out-of-
PROC MEANS DATA=data-set-name
range? Does it seem as though certain observations may be
N MIN MAX RANGE MEAN VAR
missing? Resolve any apparent data errors before proceeding
STD CV SKEWNESS KURTOSIS;
to the next step.
VAR variables-list,
CLASS class-variables-list,
After reading the raw data into a SAS data file, carefully
examine the SAS log. The SAS System will identify many
Run PROC FREQ to obtain a one-way frequency table of
data errors which may have escaped the prior notice of the
counts and percentages. This report is particularly helpful for
data analyst. The notes and error messages generated by the
analyzing qualitative types of data.
SAS System upon the creation of a SAS data set are very
instructive.
PROC FREQ DATA=data-set-name;
TABLES variables-Jist,
It is important to document the newly-created data set, and to
begin examining the elemental properties of the data. It is a
Also, run PROC CHART to produce a visual summary of the
good idea to use a PROC CONTENTS step (or, alternatively,
data. Printer graphics may not be presentation quality, but
a CONTENTS statement in a PROC DATASETS step)
they do not require much time or special equipment, and their
together with a PROC PRINT (or PROC FSVIEW) step,
results can be very powerful. PROC CHART can be used for
whenever'data are introduced to the SAS System. If the data
displaying both qualitative and quantitative types of data.
are so numerous as to make it impracticable to review a
complete listing of the data, then use the RANUNI function to
PROC CHART DATA=data-set-name;
create a random selection of observations from the data, and
VBAR variables-list / option;
in conjunction with the PRINT procedure on the sample.
or
PROC CHART DATA=data-set-name;
PROC CONTENTS DATA=data-set-name;
HBAR variables-list / option;
PROC PRINT DATA=data-set-name;
In using PROC CHART, the data analyst may want to take
WHERE RANUNI(O) <= 0.01;
control of the horizontal axis, to ensure that gaps in numeric
TITLE "1% SAMPLE FROM DATA SET";
values are noted, and of the vertical axis, to facilitate
comparisons between similar graphs.
Review the information generated by the CONTENTS and the
PRINT (or FSVIEW) procedures. Do the data attributes
1385
PROC CHART DATA=data-set-name;
VBAR variables-iist I MIDPOINTS=xx TO yy
BY zz AXIS:::uu W,
A histogram is a particular bar chart in which the range of data
values is divided into intervals of equal length, and in which
bars are used to represent the frequency of the observations
in each interval. The preceding syntax will produce a
histogram. In the VSAR statement above, an alternative to
the MIDPOINTS= ... option would be to use the LEVELS=....
option. In either ease, the number of intervals should be
represented by a zero. Data values which exceed 3
interquartile ranges are represented by asterisks.
In Version 6 of SAS. if the magnitudes of the variables are
comparable, full-page, side-by--side boxplots will be produced
whenever PROC UNIVARIATE is invoked with the PLOT
option and with a BY statement.
PROC UNIVARIATE DATA=data-set-name FREO
PLOT;
VAR variab/es-list;
BY class-variable;
chosen so as to display just enough detail as will be
meaningful to the data analyst, without being overwhelming.
With a little practice, the data analyst can use these special
If observations can be grouped together using certain
variables, then it also will be useful to picture the data using a
plots to visualize the essential features of a distribution of data
values.
block chart.
Example #1
PROC CHART DATA=data-set-name;
BLOCK variable / GROUP=class·variable;
Data Exoloration Usina PROC UNIVARIATE
Consider the following data (Friendly, pp. 4-8):
DATA FRIENDLY;
INPUT
I
SET1
SET2
SET3
SET4;
CARDS;
1
2
3
The most useful exploratory procedure is PROC
UNIVARlATE. This comprehensive procedure can be used to
generate descriptive statiStics, a frequency table, a list of
•5
extreme values, some interesting plots, and a comparison of
6
the cumulative frequency distribution with a normal
distribution. To produce box-and-whisker plots and stem-andleal displays, invoke PROC UNIVARIATE using the PLOT
7
8
9
10
option.
11
12
13
1.
15
16
17
18
19
20
PROC UNIVARIATE DATA=data-set-name
FREO PLOT;
VAR variables-/ist;
A stem-and-/eaf display is one way to convey the shape of the
distribution, as well as the value of each observation of the
variable. A stem-and-leaf display is similar to a horizontal bar
chart, except that instead of using bars, the next digit of the
number after the "stem" is used. To interpret a stem-and-Ieaf
40.50
41.50
42.50
43.50
44.50
45.50
46.50
47.50
48.50
49.50
50.50
51.50
52.50
53.50
54.50
55.50
56.50
57.50
58.50
59.50
41.64
58.36
42.29
57.71
42.93
57.07
43.57
56.43
44.21
55.79
44.86
55.14
45.50
54.50
46.14
53.86
46.79
53.21
47.43
52.57
35.00
37.00
42.00
53.90
53.00
50.60
50.50
53.80
52.50
53.60
50.40
52.20
52.70
52.40
52.70
51.40
53.80
52.90
56.81
42.79
44.50
45.00
45.50
46.00
46.50
47.00
47.50
48.00
48.50
49.00
49.50
50.00
50.50
51.00
51.50
52.00
52.50
53.00
72.71
49.79
display, follow the instructions printed beneath the display.
The four variables, SET1-SET4, have the interesting property
of sharing the same mean (~=50) and standard deviation
(0=5.92), yet the distributions of their values certainly do not
Box-and-whisker plots (also referred to as "boxplots" and
appear to be the same. What are their differences?
"schematic plots; present a visual representation of some of
the more important summary statistics. The top and bottom of
the box describe the interquartile range [the difference
First 01 all, here is the output from PROC MEANS lor this SAS
data set:
between the 25th(01) and the 75th(03) percentiles] olthe
distribution. The hOrizontal line inside the box represents the
Vllriable
median value [the 50th percentile(02)], and the plus sign
indicates the mean. The vertical lines emanating from the box
(called ''whiskers") extend up to 1.5 times the interquartile
range [that is, Irom Q1 down to 01 -1.5*(03 - 01), and lrom
03 up to 03 + 1.5*(03 - 01)]. A data value which is more
s=
,=
,...,
,....
than 1.5 interquartile ranges but within 3 interquartile ranges is
N
....
sed Dev
Mini .......
Maximwn
'"
50.0000000
5.9160798
iO.5000000
59.5000000
'"
50.0000000
5.91755"-'
41.6400000
58.3600000
'"
50.0000000
5.9159917
35.0000000
56.8100000
50.0000000
5.9162497
H .5000000
12.7100000
"
_.
.. _-- .... _.- ._ ......... -................ __ ... -.-
1386
--_ ... __ .. -.......
VJU.UE Of SET·!r0M8tK
We notice that the maxima and minima for the four variables
differ from one another.
,..
Here are the stem-and-leaf displays and box-aod-whisker
".I
I
...
plots for these data, obtained from PROC UNIVARIATE:
I
I
I
I
variable_SETl
Stem Leaf
I
"
".
5 '688
5 0022H
"
Variable-SET2
Stem Leaf
'"5654 SU
411
."
I •
>0.
...
I
I
I
"
46 184
.
U 295
42 396
..~----.
.
I
I
I
".
- - - - -- --- _.' +_.- - - -- - - - -.- -_ ••• -
SET-NtJoIBER
·
Variable_SET3
Stem Leaf
,
I
I
I
Bo.xplot
52 '29
"
.
----+-- ---- _.- -.••. _. -- --- --
1
Bcxplot
I
5 001122233314U4
This example demonstrates a pitfall of relying too much on the
mean and standard deviation to characterize abnormal data -
,'""
numerical summaries of data can be misleadingl
The data values for variable SET1 are uniformly distributed on
Variable_SET4
.stem Leaf
eoxplot
the interval [40.5, 59.5]. The mean and the median are
",•
identical, and the distribution is symmetric (skewness=O). The
negative kurtosis measure indicates that the tails of this
·
5 000012223
4
distribution are lighter than for a normal distribution.
;'6619889
,
Multiply Stem. Leaf by 10--.L
The observations of SET2 are distributed uniformly over two
Observe that boxplots provide very little information about the
intervals. with a substantial gap separating the two clusters of
data values in the vicinity of the distribution's middle values.
data. As with SET1, the distribution is symmetric, and the
Some experienced data analysts have learned to compare the
kurtosis measure is negative.
length of the whiskers to the length of the box - whiskers
which are too short may be a warning of anomalies near the
The data values for variable SET3 are distributed less evenly
center.
than those of either SET1 or SET2. The mean and the
PROC UNIVARIATE also generates statistical measures
skewness measure indicates that more data points are
which pertain to the central tendency, dispersion, and shape of
clustered at the upper end than at the lower end of the
median are distinct fram one another, and the negative
the data distribution. These statistics are not displayed here,
distribution. The positive kurtosis measure indicates that the
due to lack of space, but they are important elements of the
tails of this distribution are heavier than for a normal
data analysis.
distribution. Indeed, we notice that there are some small data
values which are fairly distant from the mean, compared to
Here are side-by-side boxplots for these data:
other data values.
SET4 has data values which are almost uniformly distrbuted
over the interval [44.5, 53.0]. We notice that there are two
irregular values, 49.79 and 72.71. The outliers cause the
mean to be greater than the median. The positive skewness
measure indicates that data values located to the right of the
mean are more spread out than the data values to the left of
the mean. The positive kurtosis measure (which is larger than
the kurtosis of SET3) denotes the heavy tail of this
distribution, which is attributable to the larger deviant value.
1387
Examining Relationships Between Variables
Mississippi
Missouri
loIS 14.3 19.6
-,~
MT
MO
NS
Nebraska
Nevada
With quantitative data, it often is important to determine
NV
Ne.. ~hire
New Jersey
whether or not a relation exists between two or more
NIl
NJ
!1M
NY
New ",.."ico
N.... York
variables. And. if they are related, it also is desirable to
North Carohna
North Oakot:a
Ohio
Ol<laho....
measure the strength of the relationships among them. This
would be useful, for example, if one was trying to estimate the
~e'iJOn
other variables. A measure of the strength of the relationship
between two variables is the correlation coefficient, which is a
number between -1 and +1. Positive correlation coefficients
indicate a direct relationship, and negative coefficients indicate
an inverse relationship. You are cautioned that just because
two variables may be highly correlated, this does not imply
OK
OR
Pennsylvania
Rhode Island
South Carolina
Sout:h DakOta
Tennesse..
Texas
utah
Vermont
Virginia
washingtOn
West Virgioia
Wisconsi"
Wyoming
values of one variable from known or assumed data values of
NC
NO
Oil
PA
RI
SC
SO
TN
'Pl
UT
'IT
VI>.
WI>.
wr
WI
65.7189.1 915.6 1239.9 IH.4
9.62&.3189.0233.5131&.32424.2378.4
5.416.7 39.2156.8 804.92773.2309.2
3.9 lB.1 64.7 ll-2.7 760.02316.1 2t9.1
15.8 49.1 323.1 355.0 245l.1 4212.6 559.2
3.210.7 :!.3.2
76.01041.72343.9293.4
5.621.0180.4185.11435.82714.5511.5
8.839.1109.6.143.4 1418.7 3008 6 259.5
10.7 29.4 472.6 319.1 1128.0 2782 0 145.8
10.617.0 61.3318.31154.12037.8 192.1
0.9 9.0 13.3 43.B H6.1 1843.0 144.7
1.827.3190.5 181.1 1216.02696.8400.4
8.629.2 73.8205.01268.22228.1326.8
4.939.9124.1 :a'.9 H3'.4 350'.1 ]8i.9
5.619.0130.3128.0 877.51624.1333.,2
3.610.5 86.5201.0 lU9.5 2844.1 791.4
11 9 33.0 105.9 485.3 1611.6 2342 .• 245.1
20 13.S 17.9155.7 570.51704.4141.5
10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
13.3 33.8 152.4 20B.2 UQl.l 2988.7 397.6
3.520.3 68.8147.3 1171.63004.6 3H.5
1.415.9 30.8101.21348.22201.0265.2
9.023.3 92.1165.7 986.22521.2226.7
4.339.6106.2224.81605.63386.9360.3
6.0 13.2 42.2
90.9 597.4 1341 7 163.3
2.812.9 52.2
637 846.92614 2 220.7
5.421.9 H.7173.9 6U.6 2772.2 282.0
that a cause-and-effect relation necessarily exists between
Now, here is the output from running PROC CORR against
them.
these data:
PROC CORR will compute correlation coefficients between all
Crime Rates Per 100,000 POpulation by Stat:e
pairs of variables specified in the VAR list
Correlatio" Ar!.alysi ..
7 'VAIl.' Variables:
PROC CORR DATA=data-set-name:
MUlWER
AAPE
LARCICKY
AUTO
ROBBBRY
ASSAUL"r
BURGLARY
VAR variables-list;
Simple Statistic!!
Variable
Besides printing correlation coefficients for each pair of
MllROER
variables, PROC CORR also determines associated
significance probabilities for each coefficient. These p-values
ROaSERY
J\SSAULT
are for testing the null hypothesis that the variables aetully
"''''''''''
""""'''
""'"
have zero correlation.
SO
50
SO
SO
50
50
SO
Mean
Std Oev
Sum
Minimum
Maximum
7.4440
25.7340
124.1
211.3
1291.9
2671.3
377.5
3.8668
10.7596
88.3486
100.3
432.5
725.9
193.4
372.2
1286.7
6204.6
10565.0
64595.2
133564
18P6.3
0.9000
9.0000
13.3000
43.6000
446.1
1239.9
144.4
15.8000
51.6000
472 6
485.3
2453.1
4467
1140
Pearson Correlal:ion Coefficients I prob
:>
lRl under He: Rho_O I II • 50
A scatter plot is a graphic representation of the relationship
ROBBERY
ASSAIlL"r
1.00000
C'.O
0.60122
0.0001
0.48371
0.0004
0.64855
0.0001
0.6~122
1.00000
0.0
0.59188
0.0001
0.74026
0.0001
O.OOO~
0.59188
0.0001
1.00000
0.0
0.55708
0.0001
0.64B55
0.0001
0.74026
0.0001
0.55708
0.0001
1.00000
0.0
SURGLARY
0.38582
0.0057
0.71213
0.0001
0.63724
0.0001
0.62291
0.0001
LARCElilY
0.10U2
0.4813
0.61399
0.0001
0.44674
0.0011
0.40436
0.0036
0.06681
0.63'19
0.34890
0.0130
0.59068
0.0001
0.27584
0.0525
BURGLARY
LARCENY
0.3858l
0.0057
0.10192
0.4813
0.06881
0.6349
0.71213
0.0001
0.61399
0.0001
0.34890
0.0130
ROBB£RY
0.63724
0.0001
0.44674
0.0011
0.59066
0.0001
1>.SSAULT
0.62291
0.0001
0.40436
0.0036
0.27584
0.0525
BURGLARY
1.00000
0.0
0.79212
0.0001
0.55795
0.0001
0.79212
0.0001
1. 00000
0.0
0.44418
0.0012
0.55795
0.0001
0.44418
0.0012
1.00000
0.0
between a pair of quantitative variables. To create Scatter
Ml1ROER
plots wrth Base SAS, PROC PLOT is used, with a PLOT
statement for each pair of variables.
O.COOl
ROSBERY
PROC PLOT DATA=data-set-name;
PLOT variable1·variable2=' * ';
PLOT variabfe3"variabfe4="
';
Examole#2
Consider the following data (from the SAS Sample library,
member PLOTLAB2):
Ml1RDER
OA."rA CRIME,
"rInE • 'Crime Rates Per 100,000 Population by State',
INPlTr STATE $ 1-15 POSTCQl)E $
MtJRDER RAPE ROBBERY
ASSAULT BURGLARY LARCENY JurI'O;
CMOS,
A"b_
Alasl<a
Arizona
Arkansas
California
Colorado
Connectieut
Delaware
Florida
Georgia
Ha .. aii
Idaho
Illinoia
Indiana
Io .. a
Ke.nsas
Kentucky
Louisiana
Maine
M.Uyland
Massachusetts
Michigan
Minnesota
AI. 14.2 25.2 %.827831135.51881.9280.7
All: 10.B 51.6 96.6284.01331.733;9.8753.3
AZ 9.534.2138.2112.32346.14467.4439.5
AR 8.827.6 83.2203.4
972.618>2.1183.4
CJI. 11.5 ;19.4 287.0 358.0 2139.4 34H.8 663.5
CO 6·.342.0170.7292.91935.23903.2477.1
cr 4.216.8129.5131.81346.02620.7593.2
OE 6.024.9157.0194.2 lU12.6 3678.;1 ;167.0
FL 10.2 H.6 187.9 H9.1 U59.9 3810.5 l51.4
GA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
aI 7.225.5128.0 64.11911.53920.4 ta9.4
10 5.519.4 39.6172.51050.82599.6237.6
IL 9.921.8211.) 209.0 1085.02828.5 52B.6
IN 7.426.5123.2153.51086.22498.7377.4
IA 2.310.6 41.2 89.8
812.52685.1 2H.9
JCS 6.6220100.7180.51270.42739.3 2H.)
I(Y 10.1-19.1 8l.1 123.3 872.2 1662.1 HS.4
I.A 15.5 30.9 ~42.9 335.5 1165.52'169.9337.7
ME 2.4 13.5 38.7170.01253.12350.7246.9
MD 8.0 34.B 292.1 358.9 1400.0 3177.7 428.5
MA 3.120.8169.1 2H.6 1532.2 2311.3 1140.1
I'll 9.338.9261.9274.6 1522.731590545.5
MN 2.719.5 85.9 85.81134.12559.3343.1
O.46n:
Notice the relatively large correlations between the pairs of
variables: BURGLARY & LARCENY, and ASSAULT & RAPE;
1388
and Ihe relatively small correlations between AUTO &
Bt.1RGl.I\RY
MURDER. and LARCENY & MURDER.
!
I
2500 ..-
I
I
Here are a couple of scatter plots which reflect the indicated
I
I
strength..of..relationship measures:
2250 ..
I
" .I
Plot of MURDER-LARCEN'l.
I
S}'I!bol "sed is " ' .
I
I
2000
I
I
I
I
I
I
I
I
1750 ..
I
I
I
.
"
+
I
I
I
I
I
I
1500 ..
" .I
I
HlJRDIR
I
1250 ..
I
I
I
I
I
I
I
I
1000 ..
..
I
I
I
I
I
I
I
I
750 ..
I
I
I
I
,.
500 ..
I
I
I
I
I
I
I
I
I
..
250 ..
I
I
1000
2000
looe
4000
5000
I
I
, .I
NOTl;!: 1 obs hidden.
I
I
I
I
I
o •
Conclusion
1000
2000
lOOO
4000
Base SAS software includes several easy-tD-use graphical
5000
and statistical procedures which can be used to summarize
MOTE: 2
~
and analyze data. The fundamental methods of exploratory
hidden.
data analysis can be used to uncover the shape of a
distribution of data values. In order to comprehend a set of
data values, it is not good enough to rely solely on numerical
summary statistics for central tendency and dispersion.
References
Michael Friendly (1991), SAS System for Statistical Graphics
First Edition Cary, NC: SAS Institute Inc.
SAS Inslilute Inc. (1990). SAS Procedures Guide VelSion 6
Third Edition, Cary, NC: SAS Institute Inc.
Sandra D. Schlotzhauer & Ramon C. Littell (1987). SAS
System for Elementary Statistical Analysis, Cary,
NC: SAS Inslitute Inc.
John W. Tukey (1977), Exploratory Data Analysis, Reading,
MA: Addison-Wesley.
SAS. SAS/INSIGHT. SASIlAB. and JMP are registered
trademarks of SAS Institute Inc. in the USA and other
countries. ® indicates USA registration.
1389