Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exploring Data With Base SAS® Software Thomas J. Winn Jr., Texas State Comptroller's Office, Austin, Texas Abstract exclusive, categorical classification, which mayor may not possess an inherent order. Numbers may be used Statistical methods are tools which are used to summarize qualitatively as cod~d values for names, but arithmetic with and analyze data. Exploratory data analysis is the application the values would be meaningless. Examples of qualitative of graphical and statistical techniques to discover the structure types of measurement include taxonomic names (such as of data. The goal of exploratory data analysis is to values for sex, race, region, political preference, etc.), or characterize the data and to reveal fundamental relationships ordinal scales (ordered categories such as: always - among them. It is quick, dynamic, and highly interactive. sometimes - never, first - second - third - fourth - fifth, etc.). Furthennore, exploratory data analysis is not just for use by Ouantitative types of measurement use numbers as cardinal professional statisticians - the methods also are used by magnitudes. scientists, engineers and many other types of researchers. This paper explains how to produce and to interpret scatter With qualitative types of data, the data analysis methods are plots. histograms, stem-and-Ieaf plots, box-and-whisker plots, mostly limited to frequency tables, bar charts, and pie charts. and various descriptive statistics, using Base SAS® software. However, quantitative types of measurement permit the use of a greater variety of tools. Introduction Overview of Descriptive Statistics Originally, the SAS System was a combination of programs for perfonning statistical analysis on data. Since then, the SAS The goal of data exploration is to comprehend the distribution System has grown in ways which have made it useful to non of the values of the variables which comprise the data, and to statisticians, as well as having increased its value to its identify some of the important ways in which the variables earliest audience of users. Most SAS users, including many appear to be related. Data summarization leans heavily on without substantial expertise in statistics, have an occasional statistical measures which pertain to central tendency, need to utilize some of the statistical and graphical capabilities dispersion, and shape of the data distribution, as well as on a of the SAS System. They want to examine their data, but few graphical techniques for data visualization. without becoming involved in advanced statistical procedures. The present paper is addressed to this group of users. Central tendency refers to a typical value from the distribution. Three commonly-used measures of central tendency are the Exploratory data analysis is the interactive use of statistical arithmetic mean (or average value), the median (the middle- and graphical procedures to uncover the composition of data; most value), and the mode (the value which occurs most that is, to identify their general characteristics and frequently). In the case of qualitative data, the mode is relationships. Data exploration is a process in which raw data commonly used as the central tendency measure. become comprehensible information through a sequence of activities, each of which must be adapted according to the Dispersion refers to the spread of the data values, usually with outcomes of the preceding steps. It is noted that SAS respect to a particular measure of central tendency. Some software now includes JMP®, SAS/INSIGHT®, and SAS/LAB measures of dispersion are the range, the variance, the ®, which were expressly designed to implement exploratory standard deviation, the coefficient of variation, and the techniques. However, many SAS users do not have access to interquartile range. Range is the difference between the these components. This paper is intended to present some of smallest and the largest data values. The standard deviation the basic tools for data exploration using elements of Base and the variance both indicate the variability (or the amount of SAS software, whether they are used in a non interactive concentration) of the data values with respect to the mean. mode (such as batch) or in an interactive mode (such as SAS The coefficient of variation is 100·standard deviation/mean. Display Manager). The interquartile range is the distance between the particular data value below which the bottom one-fourth of the data Tyoes of Data values are found (first quartile, 01), and the particular data value above which the top one-fourth of the data values are To begin with, there are two basic levels of data: qualitative, located (third quartile, 03). There is no measure of dispersion and quantitative. This is not the same thing as the difference for non-ordinal, qualitative types of data. between character and numeric variables in SAS. Qualitative types of measurement may be either character or numeric, the In addition to central tendency and dispersion, other properties essential idea is that they involve some type of mutually which are use'ful in describing the shape of a distribution are 1384 skewness, kurtosis, and the presence of outliers, gaps and (variable name, type, Jength, informat, format, label) agree multipJe peaks. Skewness is a measure of the symmetry (or with what had been anticipated? Do the numeric variables lack thereof) of the distribution. In a perfectly symmetrical take on only a limited number of distinct values, or do they distribution, the mean, the median, and the mode coincide. A have a very large number of values? Are there any aberrant distribution is said to be skewed whenever the data values are values; that is, do the data deviate unreasonably from the clustered more at one end than at the other, so that its scatter typical pattern? If there are any data errors, substitute plot seems to lean unevenly towards one side. A skewness measure would be zero when the distribution is symmetric; it corrected values for them. would be positive when more data points are clustered at the Now, run PROC MEANS to calculate simple descriptive lower end than at the upper end (the mean and the median are statistics for numeric variables in a SAS data set. If no greater than the mode); and it would be negative when more particular statistics are specified as options on the PROC data points are clustered at the upper end than at the lower MEANS statement then, for each numeric variable, the end (the mean and the median are less than the mode). variable name, number of observations, mean, standard Kurtosis is a measure of the flatness of the distribution. A deviation, minimum value, and maximum value will be very large kurtosis number would mean that some of the data reported. values are much farther away from the mean than most of the other data values; when this happens, the distribution is said PROC MEANS DATA=data-set-name; to have a "heavy tail". Outliers are data values which are far VAR variables-list, away from the rest of the data. If desired, PROC MEANS also will report the variance, the PreliminaN Examination of Data coefficient of variation, the range, the skewness, and the kurtosis (and more, if desired). If observations can be Data analysis begins with a cursory review of the raw data. grouped together USing certain variables, then a CLASS Do the values of the variables correspond to quantitative or statement can be used to obtain summary statistics across qualitative types of measurements? Do the data values each classification grouping (without sorting the data!). confonn to reasonable expectations? Do any of the values contain obvious typographical errors, or appear to be out-of- PROC MEANS DATA=data-set-name range? Does it seem as though certain observations may be N MIN MAX RANGE MEAN VAR missing? Resolve any apparent data errors before proceeding STD CV SKEWNESS KURTOSIS; to the next step. VAR variables-list, CLASS class-variables-list, After reading the raw data into a SAS data file, carefully examine the SAS log. The SAS System will identify many Run PROC FREQ to obtain a one-way frequency table of data errors which may have escaped the prior notice of the counts and percentages. This report is particularly helpful for data analyst. The notes and error messages generated by the analyzing qualitative types of data. SAS System upon the creation of a SAS data set are very instructive. PROC FREQ DATA=data-set-name; TABLES variables-Jist, It is important to document the newly-created data set, and to begin examining the elemental properties of the data. It is a Also, run PROC CHART to produce a visual summary of the good idea to use a PROC CONTENTS step (or, alternatively, data. Printer graphics may not be presentation quality, but a CONTENTS statement in a PROC DATASETS step) they do not require much time or special equipment, and their together with a PROC PRINT (or PROC FSVIEW) step, results can be very powerful. PROC CHART can be used for whenever'data are introduced to the SAS System. If the data displaying both qualitative and quantitative types of data. are so numerous as to make it impracticable to review a complete listing of the data, then use the RANUNI function to PROC CHART DATA=data-set-name; create a random selection of observations from the data, and VBAR variables-list / option; in conjunction with the PRINT procedure on the sample. or PROC CHART DATA=data-set-name; PROC CONTENTS DATA=data-set-name; HBAR variables-list / option; PROC PRINT DATA=data-set-name; In using PROC CHART, the data analyst may want to take WHERE RANUNI(O) <= 0.01; control of the horizontal axis, to ensure that gaps in numeric TITLE "1% SAMPLE FROM DATA SET"; values are noted, and of the vertical axis, to facilitate comparisons between similar graphs. Review the information generated by the CONTENTS and the PRINT (or FSVIEW) procedures. Do the data attributes 1385 PROC CHART DATA=data-set-name; VBAR variables-iist I MIDPOINTS=xx TO yy BY zz AXIS:::uu W, A histogram is a particular bar chart in which the range of data values is divided into intervals of equal length, and in which bars are used to represent the frequency of the observations in each interval. The preceding syntax will produce a histogram. In the VSAR statement above, an alternative to the MIDPOINTS= ... option would be to use the LEVELS=.... option. In either ease, the number of intervals should be represented by a zero. Data values which exceed 3 interquartile ranges are represented by asterisks. In Version 6 of SAS. if the magnitudes of the variables are comparable, full-page, side-by--side boxplots will be produced whenever PROC UNIVARIATE is invoked with the PLOT option and with a BY statement. PROC UNIVARIATE DATA=data-set-name FREO PLOT; VAR variab/es-list; BY class-variable; chosen so as to display just enough detail as will be meaningful to the data analyst, without being overwhelming. With a little practice, the data analyst can use these special If observations can be grouped together using certain variables, then it also will be useful to picture the data using a plots to visualize the essential features of a distribution of data values. block chart. Example #1 PROC CHART DATA=data-set-name; BLOCK variable / GROUP=class·variable; Data Exoloration Usina PROC UNIVARIATE Consider the following data (Friendly, pp. 4-8): DATA FRIENDLY; INPUT I SET1 SET2 SET3 SET4; CARDS; 1 2 3 The most useful exploratory procedure is PROC UNIVARlATE. This comprehensive procedure can be used to generate descriptive statiStics, a frequency table, a list of •5 extreme values, some interesting plots, and a comparison of 6 the cumulative frequency distribution with a normal distribution. To produce box-and-whisker plots and stem-andleal displays, invoke PROC UNIVARIATE using the PLOT 7 8 9 10 option. 11 12 13 1. 15 16 17 18 19 20 PROC UNIVARIATE DATA=data-set-name FREO PLOT; VAR variables-/ist; A stem-and-/eaf display is one way to convey the shape of the distribution, as well as the value of each observation of the variable. A stem-and-leaf display is similar to a horizontal bar chart, except that instead of using bars, the next digit of the number after the "stem" is used. To interpret a stem-and-Ieaf 40.50 41.50 42.50 43.50 44.50 45.50 46.50 47.50 48.50 49.50 50.50 51.50 52.50 53.50 54.50 55.50 56.50 57.50 58.50 59.50 41.64 58.36 42.29 57.71 42.93 57.07 43.57 56.43 44.21 55.79 44.86 55.14 45.50 54.50 46.14 53.86 46.79 53.21 47.43 52.57 35.00 37.00 42.00 53.90 53.00 50.60 50.50 53.80 52.50 53.60 50.40 52.20 52.70 52.40 52.70 51.40 53.80 52.90 56.81 42.79 44.50 45.00 45.50 46.00 46.50 47.00 47.50 48.00 48.50 49.00 49.50 50.00 50.50 51.00 51.50 52.00 52.50 53.00 72.71 49.79 display, follow the instructions printed beneath the display. The four variables, SET1-SET4, have the interesting property of sharing the same mean (~=50) and standard deviation (0=5.92), yet the distributions of their values certainly do not Box-and-whisker plots (also referred to as "boxplots" and appear to be the same. What are their differences? "schematic plots; present a visual representation of some of the more important summary statistics. The top and bottom of the box describe the interquartile range [the difference First 01 all, here is the output from PROC MEANS lor this SAS data set: between the 25th(01) and the 75th(03) percentiles] olthe distribution. The hOrizontal line inside the box represents the Vllriable median value [the 50th percentile(02)], and the plus sign indicates the mean. The vertical lines emanating from the box (called ''whiskers") extend up to 1.5 times the interquartile range [that is, Irom Q1 down to 01 -1.5*(03 - 01), and lrom 03 up to 03 + 1.5*(03 - 01)]. A data value which is more s= ,= ,..., ,.... than 1.5 interquartile ranges but within 3 interquartile ranges is N .... sed Dev Mini ....... Maximwn '" 50.0000000 5.9160798 iO.5000000 59.5000000 '" 50.0000000 5.91755"-' 41.6400000 58.3600000 '" 50.0000000 5.9159917 35.0000000 56.8100000 50.0000000 5.9162497 H .5000000 12.7100000 " _. .. _-- .... _.- ._ ......... -................ __ ... -.- 1386 --_ ... __ .. -....... VJU.UE Of SET·!r0M8tK We notice that the maxima and minima for the four variables differ from one another. ,.. Here are the stem-and-leaf displays and box-aod-whisker ".I I ... plots for these data, obtained from PROC UNIVARIATE: I I I I variable_SETl Stem Leaf I " ". 5 '688 5 0022H " Variable-SET2 Stem Leaf '"5654 SU 411 ." I • >0. ... I I I " 46 184 . U 295 42 396 ..~----. . I I I ". - - - - -- --- _.' +_.- - - -- - - - -.- -_ ••• - SET-NtJoIBER · Variable_SET3 Stem Leaf , I I I Bo.xplot 52 '29 " . ----+-- ---- _.- -.••. _. -- --- -- 1 Bcxplot I 5 001122233314U4 This example demonstrates a pitfall of relying too much on the mean and standard deviation to characterize abnormal data - ,'"" numerical summaries of data can be misleadingl The data values for variable SET1 are uniformly distributed on Variable_SET4 .stem Leaf eoxplot the interval [40.5, 59.5]. The mean and the median are ",• identical, and the distribution is symmetric (skewness=O). The negative kurtosis measure indicates that the tails of this · 5 000012223 4 distribution are lighter than for a normal distribution. ;'6619889 , Multiply Stem. Leaf by 10--.L The observations of SET2 are distributed uniformly over two Observe that boxplots provide very little information about the intervals. with a substantial gap separating the two clusters of data values in the vicinity of the distribution's middle values. data. As with SET1, the distribution is symmetric, and the Some experienced data analysts have learned to compare the kurtosis measure is negative. length of the whiskers to the length of the box - whiskers which are too short may be a warning of anomalies near the The data values for variable SET3 are distributed less evenly center. than those of either SET1 or SET2. The mean and the PROC UNIVARIATE also generates statistical measures skewness measure indicates that more data points are which pertain to the central tendency, dispersion, and shape of clustered at the upper end than at the lower end of the median are distinct fram one another, and the negative the data distribution. These statistics are not displayed here, distribution. The positive kurtosis measure indicates that the due to lack of space, but they are important elements of the tails of this distribution are heavier than for a normal data analysis. distribution. Indeed, we notice that there are some small data values which are fairly distant from the mean, compared to Here are side-by-side boxplots for these data: other data values. SET4 has data values which are almost uniformly distrbuted over the interval [44.5, 53.0]. We notice that there are two irregular values, 49.79 and 72.71. The outliers cause the mean to be greater than the median. The positive skewness measure indicates that data values located to the right of the mean are more spread out than the data values to the left of the mean. The positive kurtosis measure (which is larger than the kurtosis of SET3) denotes the heavy tail of this distribution, which is attributable to the larger deviant value. 1387 Examining Relationships Between Variables Mississippi Missouri loIS 14.3 19.6 -,~ MT MO NS Nebraska Nevada With quantitative data, it often is important to determine NV Ne.. ~hire New Jersey whether or not a relation exists between two or more NIl NJ !1M NY New ",.."ico N.... York variables. And. if they are related, it also is desirable to North Carohna North Oakot:a Ohio Ol<laho.... measure the strength of the relationships among them. This would be useful, for example, if one was trying to estimate the ~e'iJOn other variables. A measure of the strength of the relationship between two variables is the correlation coefficient, which is a number between -1 and +1. Positive correlation coefficients indicate a direct relationship, and negative coefficients indicate an inverse relationship. You are cautioned that just because two variables may be highly correlated, this does not imply OK OR Pennsylvania Rhode Island South Carolina Sout:h DakOta Tennesse.. Texas utah Vermont Virginia washingtOn West Virgioia Wisconsi" Wyoming values of one variable from known or assumed data values of NC NO Oil PA RI SC SO TN 'Pl UT 'IT VI>. WI>. wr WI 65.7189.1 915.6 1239.9 IH.4 9.62&.3189.0233.5131&.32424.2378.4 5.416.7 39.2156.8 804.92773.2309.2 3.9 lB.1 64.7 ll-2.7 760.02316.1 2t9.1 15.8 49.1 323.1 355.0 245l.1 4212.6 559.2 3.210.7 :!.3.2 76.01041.72343.9293.4 5.621.0180.4185.11435.82714.5511.5 8.839.1109.6.143.4 1418.7 3008 6 259.5 10.7 29.4 472.6 319.1 1128.0 2782 0 145.8 10.617.0 61.3318.31154.12037.8 192.1 0.9 9.0 13.3 43.B H6.1 1843.0 144.7 1.827.3190.5 181.1 1216.02696.8400.4 8.629.2 73.8205.01268.22228.1326.8 4.939.9124.1 :a'.9 H3'.4 350'.1 ]8i.9 5.619.0130.3128.0 877.51624.1333.,2 3.610.5 86.5201.0 lU9.5 2844.1 791.4 11 9 33.0 105.9 485.3 1611.6 2342 .• 245.1 20 13.S 17.9155.7 570.51704.4141.5 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 13.3 33.8 152.4 20B.2 UQl.l 2988.7 397.6 3.520.3 68.8147.3 1171.63004.6 3H.5 1.415.9 30.8101.21348.22201.0265.2 9.023.3 92.1165.7 986.22521.2226.7 4.339.6106.2224.81605.63386.9360.3 6.0 13.2 42.2 90.9 597.4 1341 7 163.3 2.812.9 52.2 637 846.92614 2 220.7 5.421.9 H.7173.9 6U.6 2772.2 282.0 that a cause-and-effect relation necessarily exists between Now, here is the output from running PROC CORR against them. these data: PROC CORR will compute correlation coefficients between all Crime Rates Per 100,000 POpulation by Stat:e pairs of variables specified in the VAR list Correlatio" Ar!.alysi .. 7 'VAIl.' Variables: PROC CORR DATA=data-set-name: MUlWER AAPE LARCICKY AUTO ROBBBRY ASSAUL"r BURGLARY VAR variables-list; Simple Statistic!! Variable Besides printing correlation coefficients for each pair of MllROER variables, PROC CORR also determines associated significance probabilities for each coefficient. These p-values ROaSERY J\SSAULT are for testing the null hypothesis that the variables aetully "'''''''''' """"''' ""'" have zero correlation. SO 50 SO SO 50 50 SO Mean Std Oev Sum Minimum Maximum 7.4440 25.7340 124.1 211.3 1291.9 2671.3 377.5 3.8668 10.7596 88.3486 100.3 432.5 725.9 193.4 372.2 1286.7 6204.6 10565.0 64595.2 133564 18P6.3 0.9000 9.0000 13.3000 43.6000 446.1 1239.9 144.4 15.8000 51.6000 472 6 485.3 2453.1 4467 1140 Pearson Correlal:ion Coefficients I prob :> lRl under He: Rho_O I II • 50 A scatter plot is a graphic representation of the relationship ROBBERY ASSAIlL"r 1.00000 C'.O 0.60122 0.0001 0.48371 0.0004 0.64855 0.0001 0.6~122 1.00000 0.0 0.59188 0.0001 0.74026 0.0001 O.OOO~ 0.59188 0.0001 1.00000 0.0 0.55708 0.0001 0.64B55 0.0001 0.74026 0.0001 0.55708 0.0001 1.00000 0.0 SURGLARY 0.38582 0.0057 0.71213 0.0001 0.63724 0.0001 0.62291 0.0001 LARCElilY 0.10U2 0.4813 0.61399 0.0001 0.44674 0.0011 0.40436 0.0036 0.06681 0.63'19 0.34890 0.0130 0.59068 0.0001 0.27584 0.0525 BURGLARY LARCENY 0.3858l 0.0057 0.10192 0.4813 0.06881 0.6349 0.71213 0.0001 0.61399 0.0001 0.34890 0.0130 ROBB£RY 0.63724 0.0001 0.44674 0.0011 0.59066 0.0001 1>.SSAULT 0.62291 0.0001 0.40436 0.0036 0.27584 0.0525 BURGLARY 1.00000 0.0 0.79212 0.0001 0.55795 0.0001 0.79212 0.0001 1. 00000 0.0 0.44418 0.0012 0.55795 0.0001 0.44418 0.0012 1.00000 0.0 between a pair of quantitative variables. To create Scatter Ml1ROER plots wrth Base SAS, PROC PLOT is used, with a PLOT statement for each pair of variables. O.COOl ROSBERY PROC PLOT DATA=data-set-name; PLOT variable1·variable2=' * '; PLOT variabfe3"variabfe4=" '; Examole#2 Consider the following data (from the SAS Sample library, member PLOTLAB2): Ml1RDER OA."rA CRIME, "rInE • 'Crime Rates Per 100,000 Population by State', INPlTr STATE $ 1-15 POSTCQl)E $ MtJRDER RAPE ROBBERY ASSAULT BURGLARY LARCENY JurI'O; CMOS, A"b_ Alasl<a Arizona Arkansas California Colorado Connectieut Delaware Florida Georgia Ha .. aii Idaho Illinoia Indiana Io .. a Ke.nsas Kentucky Louisiana Maine M.Uyland Massachusetts Michigan Minnesota AI. 14.2 25.2 %.827831135.51881.9280.7 All: 10.B 51.6 96.6284.01331.733;9.8753.3 AZ 9.534.2138.2112.32346.14467.4439.5 AR 8.827.6 83.2203.4 972.618>2.1183.4 CJI. 11.5 ;19.4 287.0 358.0 2139.4 34H.8 663.5 CO 6·.342.0170.7292.91935.23903.2477.1 cr 4.216.8129.5131.81346.02620.7593.2 OE 6.024.9157.0194.2 lU12.6 3678.;1 ;167.0 FL 10.2 H.6 187.9 H9.1 U59.9 3810.5 l51.4 GA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 aI 7.225.5128.0 64.11911.53920.4 ta9.4 10 5.519.4 39.6172.51050.82599.6237.6 IL 9.921.8211.) 209.0 1085.02828.5 52B.6 IN 7.426.5123.2153.51086.22498.7377.4 IA 2.310.6 41.2 89.8 812.52685.1 2H.9 JCS 6.6220100.7180.51270.42739.3 2H.) I(Y 10.1-19.1 8l.1 123.3 872.2 1662.1 HS.4 I.A 15.5 30.9 ~42.9 335.5 1165.52'169.9337.7 ME 2.4 13.5 38.7170.01253.12350.7246.9 MD 8.0 34.B 292.1 358.9 1400.0 3177.7 428.5 MA 3.120.8169.1 2H.6 1532.2 2311.3 1140.1 I'll 9.338.9261.9274.6 1522.731590545.5 MN 2.719.5 85.9 85.81134.12559.3343.1 O.46n: Notice the relatively large correlations between the pairs of variables: BURGLARY & LARCENY, and ASSAULT & RAPE; 1388 and Ihe relatively small correlations between AUTO & Bt.1RGl.I\RY MURDER. and LARCENY & MURDER. ! I 2500 ..- I I Here are a couple of scatter plots which reflect the indicated I I strength..of..relationship measures: 2250 .. I " .I Plot of MURDER-LARCEN'l. I S}'I!bol "sed is " ' . I I 2000 I I I I I I I I 1750 .. I I I . " + I I I I I I 1500 .. " .I I HlJRDIR I 1250 .. I I I I I I I I 1000 .. .. I I I I I I I I 750 .. I I I I ,. 500 .. I I I I I I I I I .. 250 .. I I 1000 2000 looe 4000 5000 I I , .I NOTl;!: 1 obs hidden. I I I I I o • Conclusion 1000 2000 lOOO 4000 Base SAS software includes several easy-tD-use graphical 5000 and statistical procedures which can be used to summarize MOTE: 2 ~ and analyze data. The fundamental methods of exploratory hidden. data analysis can be used to uncover the shape of a distribution of data values. In order to comprehend a set of data values, it is not good enough to rely solely on numerical summary statistics for central tendency and dispersion. References Michael Friendly (1991), SAS System for Statistical Graphics First Edition Cary, NC: SAS Institute Inc. SAS Inslilute Inc. (1990). SAS Procedures Guide VelSion 6 Third Edition, Cary, NC: SAS Institute Inc. Sandra D. Schlotzhauer & Ramon C. Littell (1987). SAS System for Elementary Statistical Analysis, Cary, NC: SAS Inslitute Inc. John W. Tukey (1977), Exploratory Data Analysis, Reading, MA: Addison-Wesley. SAS. SAS/INSIGHT. SASIlAB. and JMP are registered trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. 1389