Download Getting to Know Your Data Using SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
146
Beginning Tutorials
GETTING TO KNOW YOUR DATA USING SAS
Felicia A. Borglum
Arthur D. Little, Inc., Cambridge, MA
Abstract
Before you begin any data analysis, you need to know the
overall structure, content, and COntext of the data. At a
minimum, this requires an in-depth look at variable
definitions and ranges of values, an understanding of the
underlying probability density functions, and a strategy to
handle missing values. Basic simple descriptive statistics
can display data in an easy to UJlderstand manner. This paper
is intended for those with a minimal background in statistics
with a desire to "get their hands dirty" using SAS®
procedures to perform exploratory data analysis.
Introduction
The topic of simple descriptive statistics is an all
encompassing one. Many statistical text books have been
written on this subject One main goal the authors would
like to get across to their readers, whoever they may be engineers, researchers, consultants, academicians and
students - is that simple descriptive statistics playa crucial
role in data analysis. This paper will give a brief overview
of some basic tools the SAS® system has to offer to
complete this task.
such as PROC UNIVARIATE, PROC MEANS,
PROC SUMMARY, PROC TABULATE, and
PROC CORR. Highlighted in this paper are examples of
several of these procedures using real data and a brief
interpretation of the oulput.
One of the most widely used procedures in examining
discrete data is the frequency procedure (PROC FREQ).
Let's say, for example, you have just received a data tape
from a researcher on transactions of cable customers. You
wish to know what the possible types of transactions the
cable company has recorded on these customers over a one
year period of time and determine their activity status. The
researcher supposedly provided you with a "clean" datatape.
An ideal procedure to view these customers activity is using
frequency distributions.
=
PROC FREQ data
transact;
TITLE 'Initial Frequencies';
TABLES type status;
RUN;
Definition of Variables
Before examining your data, one must define the variables of
interest. There are two types of quantitative variables that a
researcher can count or measure. They are discrete and
commucus variables. Examples of discrete variables include:
counting the number of tomatoes in a garden plot, the
number of heads achieved in 10 coin tosses and categorical
variables such as sex, race or hair color. Discrete variables
have a finite number of values. Examples of continuous
variables are the temperature at a given point in time,
distance between two cities, the height of tomato plants,
weight of the tomato. These variables can take on virtually
an infmite number of values.
The SAS® system has several procedures designed for
conducting simple statistical analysis. For discrete data
PROC FREQ or PROC CHART are helpful tools in
examining categorical data. For continuous data. procedures
NESUG
192
Proceedings
FIGURE la.
TYPE
FREQUENCY
PERCENT
CUMULATIVE
FREQUENCY
CUMULATIVE
PERCENT
6
37
46
87
99
9282
9885
1599
3491
38.3
40.8
6.6
14.4
9282
19167
20766
24257
38.3
79.0
85.6
100.0
Beginning Tutorials
FIGURE lb.
147
FIGURE 2
Initial Frequencies
Initial Frequency Bar Charts
STATUS
FREQUENCY
PERCENT
42160
36
21
. 51
185786
2
13
4
2
4
2
35
1
2
4
79111
1
4
794
1
0.0
0.0
0.0
69.9
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
29.8
0.0
0.0
0.3
0.0
CUMULATIVE
FREQUENCY
CUMULATIVE
PERCENT
------------------------------------- .. --.-----------
Ace
ACI
ACR
ACT
ACU'
ACV
ATC
ATT
ATV
DIA
DIC
DIG
DID
DIR
DIS
DIX
INI
INT
INV
36
57
108
185894
185896
185909
185913
185915
185919
185921
185956
185957
185959
185963
285074
265075
285079
265873
265874
0.0
0.0
0.0
69.9
69.9
69.9
69.9
69.9
69.9
89.9
69.9
69.9
89.9
69.9
99.7
99.7
99.7
100.0
100.0
From the results in Figures la and 1b, we can see that 6 of
our observations contain no transaction 'type' description.
The other 24,257 observations break down into the
following codes: '37','46','87' and '99'. The activity status
seems to be more of a problem for the researcher. The
majority of our codes are either 'blank', 'ACT', 'DIS' or
'INT',. Further discussions with the researcher indicates that
the remaining codes are mistakes in data entry that his
department analyst must resolve.
FREQUENCY BAR CHART
TYPE
37
~.~********.*******
FREQ
CUM.
FREQ
PERCENT
CUM.
PERCENT
9282
9282
38.27
38.27
46
------_."_._ ....,,...
9885
19167
40.75
79.02
87
***
1599
20766
6.59
85.61
3491
24257
14.39
100.00
99
.......
+-------+-------+---4000
8000
o
FREQUENCY
For describing the relationship between pairs of discrete
variables PROC FREQ can display the data in a two way
table. The example below shows the breakdown of data
collected on 2 discrete variables, PROm and STAFFING
from a large questionnaire on contruction jobs.
PROC FREQ DATA = profit;
TITLE 'Relationship between Profit and
Staffing';
TABLES PROFIT· STAFF/CHISQ;
RUN;
Secondly, we could use the PROC CHART procedW'e to
further describe graphically the values of transaction type.
An example of the procedure commands and the output
follows.
=
PROC CHART data
transact;
TITLE 'Initial Frequency Bar Charts';
HBAR TYPE;
RUN;
NESUG '92 proceedings
148
Beginning Tutorials
FIGURE 3
In the example above, jobs are much more profitable when
quality of staff is higher. Jobs which recognize a loss usually
have moderate to low quality staff. A highly significant ChiSquare indicates a strong relationship between these 2 variables,
PROFIT and STAFFING. [It should also be noted here that
the data in this example was fonnatted (using PROC FORMAT)
for ease in use and display. The variable staffing was recorded on
a 7 pt. scale - the 3 levels noted above are combinations of the
following values: 'Poor' 1,2; '4' 3,4,5; and High 6,7.]
Relationship between PROFIT and STAFFING
TABLE OF PROFIT BY STAFF
STAFF(Quality of staff)
PROFIT(Prafit or loss)
Frequency
Percent
Row Pct
Cal Pct
=
14
Poor
1 Total
IHigh
-----------------+--------+--------+--------+
NonEventl Profi t O l l
0.00
16.18
0.00
32.35
0.00
31.43
23
34
33.82
67.65
92.00
50.00
-----------------+--------+--------+--------+
Eventl Lass
8
11.76
23.53
100.00
24
35.29
70.59
68.57
2
2.94
5.88
8.00
34
50.00
-----------------+--------+--------+--------+
Total
8
11.76
Frequency Missing
35
51.47
25
36.76
68
100.00
=3
STATISTICS FOR TABLE OF PROFIT BY STAFF
Statistic
Chi·Square
Likelihood Ratio Chi·Square
Mantel·Haenszel Chi·Square
Phi Coefficient
Contingency Coefficient
Cramer's V
OF
Value
?rob
2
2
30.469
36.755
29.878
0.669
0.556
0.669
0.000
0.000
0.000
1
Effective Sa_ple Size = 68
Frequency Missing = 3
WARNING: 331 of the cells have expected counts less
than 5. Chi·Square Day not be a valid test.
The above statement will produce a two-way table between
profit of a construction jobs and the staffing. Statistics
printed when the CHISQ option is requested include: test for
independence as Pearson Chi-Square (Xl), likelihood ratio
Chi-Square and Mantel-Haenzel Chi-Square and other
measures of association such as the Phi coefflcient, Cramer's
V and the contingency coefficienL Similar to a Pearson
correlation, large significant values of a Chi-Square indicate a
relationship between the two variables tested - that is, levels
of.one value depend on levels of the other,
NESUG '92 Proceedings
=
=
Continuous variables, on the other hand, have more than a few
levels of values. Using the frequency procedures would not be an
efflcient way to view continuous data. The UNIVARIATE
procedure provides basic descriptive statistics for continuous
variables and is an excellent way to view these types of data.
As an example, a reseaICher has' asked you to look at data from
an experiment on cycles to failure of aircraft test panels. You are
somewhat familiar with the data and know that the raw data is
not nonnally distributed and that a log transformation will help
the data to achieve normality [Note: satisfying the assumption
of normality is necessary when using parametric methods for
analysis such as regression (REG, STEPWISE, RSREG),
Analysis of Variance (ANOVA, GLM) and Discriminant
Analysis (DISCRIM).
An example of PROC UNIVARIATE follows:
PRoe UNIVARIATE data=alr normal plot;
TITLE 'Unlvariates • Cycles & Log(Cycles);'
VAR cycles Icycles;
RUN;
Beginning Tutorials
149
FIGURE 4
UNIVARIATE PROCEDURE
Variable=CYClES (Cycles to Failure)
~ Quanti les(Def=S)
Moments
i.
ean
Std Dev
Skewness
USS
CV
T:Mean=o
S9n Rank
Num "= 0
W:Normal
Stem leaf
12 13
10 0
27 Sum Wgts
410042.6 Sum
353378.~ariance
1.38815 '" urtosis
7.786E12 CSS
86.1809~td Mean
8.029352 prob>ITI
189 Prob> S
27
0.833348 Prob<W
'1
27
11071149
1.249El1
1.248846
3.247E12
68007.73
0.0001
0.0001
2
Boxplat
0
1
----+----+----+----+
5
7
10
1300000+
Obs
20)
24)
26)
14)
21)
Highest Obs
662000( 7)
873600 ( 1)
1097000 ( 4)
1210000( 2)
1333000( 9)
~
Normal ProbabiLity Plat
* . +++
++++++
I
I
I
6 6
4 71234
2 3590157
o 6788233889
lowest
63000{
71S00(
80000{
S1S00(
117500(
Extremes
0.0004
#
8 7
Max 1333000 9St 1333000
75% Q3 526700 95~ 1210000
~50~ Med 303800 90% 1097000
25% Ql
134000 10% 80000
O~ Min
63000 5~ 71500
1~
63000
1270000
392700
63000
100~
®
*+++++
I
700000+
++++*.
++++ ••• --.
+--+--+
*.- .. --"
++ ••• '* .....
.........
I
100000+
+-----+
+----+----+----+----+----+----+----+----+----+----+
Multiply Stem. leaf by 10**+5
*=
+
-2
-I
0
+1
+2
~ O()J'G. ua..lue.s.
.
fJOIrKA.l ~ j)ta.qOfJo..J
VariabLe=lCYClES (log-10 CycLes)
Moments
N
Mean
Std Dey
Skewness
USS
CV
T:"'e8n=0
S9n Rank
NUll "= 0
W:NarllaL
27
5.46031 I
0.3S1947
-0.03099
808.7979
6.994971
74.28412
189
27
0.96645
Quantiles(Def=5)
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
prob>ITI
Prob> S
Prob<W
Stell Leaf
60 482
58 24
5671223
54 068947
52 4495
SO 723
~ ~01
<IS
•••• + •••• + ..... +----+
MuLtiply Stell.Leaf by 10**-'
27
147.4284
0.145884
-0.83854
3.792974
0.073506
0.0001
0.0001
100% MaX
75~ Q3
50~
25~
~
Range
Q3-Ql
Mode
6.12
5.72
Mad 5.48
Ql
5.12
Min 4.79
9St 6.12
95~ 6.08
90% 6.04
1~
4.90
5~ 4.85
1~
4.79
1.32549
0.594459
4.799341
Extremes
lowest Obs Highest Obs
4.79( 20) 5.82(
7)
I)
4.85( 24) 5.94(
4.90{ 26)
6.04(
4)
4.91( 14) 6.08(
2)
S.07( 21 ) 6.12(
9)
0.5362
#
3
2
5
6
4
NormaL ProbabiLity PLat
Boxplot
6.1+
I
• +*+++*
+*.+++
+ ........ +
* .. _+_ .. *
I
I
3
+ ....... -+
4
I
++*+••
+*++* •
4.7+
+ ..
++*++
---+----+ .. ---+ ........+-- ... + ........ +----+ .. ---+ ...... -+----+
-2
-I
0
+'
+2
NESUG '92 Proceedings
150
Beginning Tutorials
The UNIVARIA1E procedure describes the range of values a
particular variable can take on. Referring to the output in
Figure 4, several descriptive statistics are described below.
1.
N - The number of non-missing observations for the
variable CYCLES.
2. The MEAN, MEDIAN & MODE are all measures of
central tendency. The MEAN (or the arithmetic
average) is more widely used as a measure of central
tendency.
After ordering the observations, the MEDIAN (50th
percentile) is the midpoint of the distribution.
Q3 • Ql - inter quartile range
VARIANCE -
SUM(X_X)2/(N-I) measures the squared
distance from the sample mean
sm -standard deviation sqrt (variance)
STD MEAN - SIandard deviation about the mean also referred
as the standard error of the mean.
Other measures of variability are the corrected and uncorrected
sums of squares (CSS, USS) and are descnbed in the SAS
manuals.
6. Skewness, Kurtosis
The MODE is the value which has the maximum
density (or the value which occurs most frequently).
Under a bell-shaped curve, or a normal distribution the
MEAN
MEDIAN
MODE. For certain
distributions, the MEDIAN may be a more appropriate
statistic to describe the data values. It is much less
sensitive to extreme values.
=
=
3. Percentiles/QuartiIes/Quantiles
When ordered, the data can be described using percentiles.
The p - th percentile is the data value in which p% of the
observations falI below. Several noteworthy percentiles are:
0%· minimum
25% • Q1 (first quartile)
50% • Q2 (median 50th percentile)
75% • Q3 (third quartile)
100% • maximum
4. Extremes
After ordering the observations lowest to highest, these
represent the 5 lowest and 5 highest values. This can be very
useful for identifying outliers.
5. Measures of Variability
These statistics measure the spread of the data.
RANGE - measures distance between the minimum and
maximum
NESUG '92 Proceedings
SKEWNESS measures the shape of the distribution on one
side of the distribution vs. the other. For symmetric
distributions where the MEAN > MEDIAN, this indicates
positive skew (or skewed to the right), for distribution where the
MEAN < MEDIAN, this indicates negative skew (skewed to
the left).
KURTOSIS measures the heaviness in the tails of the
distribution. Positive values indicate heavy tails, negative values
indicate lighter tails. For normal distributions the KURTOSIS
=0.
7,8.
Stem & Leaf Plot/Normality Plot
The 'PLOT' option on PROC UNIV ARIATE will produce a
stem and leaf plot. These are helpful when viewing a
distribution of the data. In larger data sets, this option produces a
histogram.
The 'NORMAL' option on PROC UNlVARIA1E will produce a
normal probability plot.
Points which falloff the diagonal line in a normality plot
indicate departureS from normality. Alternative procedures can be
used to create normal probability plots. These will not be
discussed in this paper.
Beginning Tutorials
PROC PLOT/PRoe CORR
151
Salary by Years Experience
PROC PLOT can be used to plot pairs of observations fer
two variables in a data set. This procedure provides a very
valuable tool in data checking. The correlation procedure
(pRoe eORR) produces correlation coefficients between
sets of variables. Combining these two procedures can be
very powerful in exploratory data analysis. Below is a
simple example of the combination of these two procedures.
Legend: A = lobs. 8
Plot of ASALARY*AEXP.
80000
1
= 2 obs.
etc.
o
A
c 60000 +
a
=
PRoe eORR data
salary;
TITLE 'Corr - Salary by Yrs. Exp; ,
VAR ASALARY;
WITH AEXP;
RUN;
d
e
m
A
c
A
40000 +
A
A
S
=
•l
PRoe PLOT data
salary;
TITLE 'Plotting Salary by Yrs. Exp;
PLOT ASALARY*AEXP;
RUN;
A
A
A
A8A8AMA
AAA8A A
AC B M
A CECC A
A AMOS
a
y
20000 +
A.
FIGURE 5
08
A
A.
A.8
ACA.
M
SA
A
Corr - Salary by Years Experience
0+
CORRELATION ANALYSIS
'WITH' Variables:
'VAR' Variables:
o
AEXP
ASALARY
AEXP
ASALARY
N
Mean
Std Dev
Sum
83
83
20.2499
26029.5
10.0919
10847.0
1680.7
2160450
Simple Statistics
Variable
Minimum
Maximum
Label
AEXP
ASALARY
3.4300
4070.0
63.1800
74090.0
Academic Experience
Academic Salary
Pearson Correlation Coefficients I Prob > IRI under Ho: Rho=O
I H = 83
ASALARY
AEXP
Academic Experience
0.86095
0.0001
40
60
80
Academic Experience
Simple Statistics
Variable
20
Similar to the univariate procedure, PROC CORR produces
simple statistics (N, MEAN, STD, MEDIAN, MIN,
M A X) and, in addition, calculates several correlation
coefficients. For continuous data, the most widely used is the
Pearson Product-moment correlation (R). Other correlation
coefficients calculated are: Kendall's tau-b, Spearman's rank
correlation and the Hoeffding D-statistic. Each of these
correlation statistics measures the relationship between two
variables. The range of values the Pearson (R) correlation
coefficient can take on are between -1 and 1.
A value of 1 or
(-1) indicates perfect positive (or negative) correlation and, when
plotted, subsequently fallon a straight line. A value of 0
indicates no correlation or no relationship between the two
variables. Plotting two variables with no correlation will yield
random scatter and usually the points will be concentrated in a
circle in the center of the plot.
NESUG '92 Proceedings
152
Beginning Tutorials
In our example, we see a strong correlation (R=.86) between
years of experience and salary in an academic field. Also
noted are 2 points which may be targeted as outliers. Further
investigation will be needed.
As we have seen in an example above, combining these two
procedures one can help to identify outliers and trends in the
data and is often useful before running many statistical
procedures such as regression, discriminant analysis ax!.
analysis of variance.
Conclusions
The above provide a brief description of some basic tools
useful in getting to know your data. The examples are by no
means the only way to view your data, however, they
provide straight forward, easy to understand methods for
doing preliminary data analysis.
References
Snedecor, George W. and Cochran, William G. Statistical
Methods 7th Ed. Iowa State University Press, Ames, Iowa,
1980.
H. Lyman. An Imroduction
!Q Statistical Methods and Data
Analysis. DlJl(bury Press, N. Scituate, MA. 1977.
SAS® Urers Guide· Basics Version 5tb Edition Cary, NC.
1985.
SAS® is a registered trademarlc of SAS Institute, Inc., Cary,
NC.
The author would like to thank Mike Stockstill, a
statistician at the SAS Institute, Cary, NC. This talk is a re. presentation of his talk given at the BASUG Meeting in the
Spring of 1991.
NESUG '92 Proceedings