Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SAS Essentials III: Statistics for Hamsters AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT "For next year, I would like to see a workshop offered on statistics so easy a hamster can understand it. Bring your own hamster." - Workshop Attendee Here it is! No actual hamsters were involved, but the statistics in this session were all previously presented to many classes of middle school students (who sometimes have the attention span of hamsters). The examples use the national dataset and a 1% sample of California residents, data from the American Community Survey, downloaded from the U.S. Census website. Teachers suggested questions of interest to students, including employment, income and education by race and ethnicity. Statistics were produced using SAS. Graphics were created using JMP 8 and SAS 9.2. These results were incorporated as part of a lesson that used SAS output and graphics to illustrate the concepts of frequency distribution, histogram, mean, median, mode, pie charts, correlation, sample selection and group differences. This session is recommended not only for those interested in a refresher in basic statistics but also for anyone who would like to apply their SAS skills to supporting their local schools through volunteering as a guest speaker. Integrating SAS with the curriculum can show students applications of programming and statistics to the social studies, science and mathematics they are learning in schools and to issues important in their own lives. Try it! You may be overwhelmingly surprised by the welcome you get from teachers and students alike. INTRODUCTION My favorite comment on workshop evaluations from last year’s Western Users of SAS Software conference was in answer to the question about what the attendees would like to see next year. One person wrote, “For next year, I would like to see a workshop offered on statistics so easy a hamster can understand it. Bring your own hamster." There are several reasons you may wish to have hamster-level statistics. First, you never had a statistics course, or it was so long ago that you were sitting between Fred Flintstone and Barney Rubble. You know that the mean is a bad measure for average income, but aren’t certain why. Perhaps, you have young relatives and have been drafted to explain sampling error to a sixth-grader who has never heard the term in his or her life. If you believe this is easy, your own sample of sixth-graders studied must be very small. Maybe you need to explain statistics to co-workers who, while significantly smarter than hamsters are no more interested in statistics. Third, you may be interested in volunteering to help your local schools, but you’d like some information, activities and examples that might help you, rather than just showing up and saying, “I’m a statistician and I’m here to help.” Whether you’re learning to use statistics for the first time, or trying to explain it to someone else, adding the need to learn the SAS code to create the statistics and then interpret the output adds a whole additional layer of time. No actual hamsters were involved, but the statistics in this session were all previously presented to 18 classes of middle school students (who sometimes have the attention span of hamsters). SAS code is provided. The project discussed in this paper came through the juxtaposition of three random facts; I read the Los Angeles Times, I’m fully convinced of the possibilities of open data and at one point, I worked on the third floor of a building where I could look down directly into a low-performing urban middle school. You’d have to be on a desert island not to be aware of the crisis in American education. As we have already established, I was not on a desert island but rather in an office building in downtown Los Angeles reading the Times on my iPad when I probably should have been writing a SAS program. There is also some evidence that we have forced teachers to focus so much on students’ ability to answer certain types of test questions that we haven’t allowed them time to teach some really important concepts, like how to formulate their own questions, how to apply the information they are learning. Both of these problems - children lacking support for education, and teachers without adequate time to spend on other than rote learning – are compounded and almost insurmountable at certain schools. Added to all of this is substantial research to show that the idea that some people “are just not good at math” is a myth (Hersh & John-Steiner, 2011). On the contrary, people who are very good at math just spend a lot of time doing it (Dehaene, 2004). This all bothered me, so …. I called a teacher at one of the middle schools featured in the newspaper and offered my assistance. Since then, I have given some version of this presentation 18 times, at three different urban middle schools in two states. FREQUENCY DISTRIBUTIONS (BEFORE THE SEMI-COLON PART) Before starting any presentation with statistics it is crucially important that you explain each term, and, as much as possible, get the students involved. As far as the initial activity to illustrate what is a frequency distribution, I tried to come up with something everyone could answer – and that kids would be interested in the answer, both because it isn’t necessarily something you know about everyone in your class, and also because in middle school, students are very interested in whether or not they are “normal”, in how they fall relative to others. I began by asking each student how many people lived in his or her home. I drew a graph on the board and as each person told me the answer, I put an X on the graph to indicate their family. The very first point I want the students to understand is that each of those points represents something about one person. The second point is that what it represents is the answer to a question. There is no one at zero or one because, at the very least, you must live in your home, and in the seventh grade, no one lives alone. So, one of the first uses of statistics can be seen right here. You can tell if people are lying right away if you see out-of-range values. So, now we have our histogram which is the chart of the answers people gave by the frequency of those answers. Some people also call this a bar chart. (This is the part where I draw bars around the X’s). The mode is the most common score. We can see on our chart here that the mode is --- (ask for a student volunteer) --- the mode is 5. Any time you’re looking at a chart of the data, it’s easy to see which is the most common score. It is the highest bar. If there are two that are equally high, your distribution is called bi-modal, which means it has two modes. The median is the score that half of the people score below, and half score above. There are 17 people in this class. If we look at everyone in a distribution, from lowest to highest, the ninth person will be at the median, 8 people will be higher and 8 people will be lower. The median in this distribution is also 5. The median and mode are two measures of what it means to be average, what statisticians refer to as “central tendency”, that is what does the center, or average, tend to be like? OPEN DATA TO INTRODUCE STATISTICS THE CENSUS BUREAU OFFERS AN ANSWER KEY FOR REAL-LIFE, OR DO TRY THIS AT HOME Now that we have an understanding of distributions and measures of central tendency, let’s use open data to pursue this a little more in-depth. “Open data”, is data freely available to anyone to use or publish. I am a big proponent of the use of open data, and all of the examples presented here use these data. There are many advantages of using open data, principal among them is that it is free, as in free beer, and exists in a dazzling diversity of sizes, topics and formats. No matter what SAS statistic or technique you want to learn to use, there is an open data set on the Internet you can download and use for it. It can be disconcerting for new graduates performing statistical analyses to realize there is no one to tell them if their results are correct. I still remember the meeting with my graduate advisor shortly before I sent off my first article for publication in a scientific journal. I asked him if the results section was correct. He looked at me over the top of his glasses and said, “Well, I certainly hope so.” Then he added, “Young lady, there’s no answer key for life.” That was before open data. An often overlooked advantage for anyone just beginning to use SAS for large data sets, or statistics, is that there may be published statistics for at least some of your analyses to check your results against to see if you are on the right track. For example, the U.S. Census Bureau publishes results for some selected variables on-line for you to check your results. Go to the PUMS documentation website http://www.census.gov/acs/www/data_documentation/pums_documentation/ Click the + next to “Help with Using PUMS” to expand this category. Click on the link in the sentence “Data users who have doubts about the way they are computing estimates should attempt to reproduce the estimates that are provided in the Verification Files availabe in PUMS documentation “ and you’ll see the options for user verification. Click the LST option (the second one) for the year that you are using and a page will pop up that tells you the correct estimates for the U.S. and each state. Voila! Answer key for real life. Your results should match exactly with what is in that first column. For example, I selected the 2009 Public Use Microdata Set (PUMS). In the LST file under estimates for the United States, I see this: State of current residence=00 State=United States Characteristic Total population Total males (SEX=1) Total females (SEX=2) 2009 PUMS Estimate 307,006,556 151,373,350 155,633,206 2009 PUMS 2009 PUMS SE MOE 289 476 19915 32760 19963 32839 When I ran my SAS program, compared the results I obtained by the estimates for total population, males and females. As you can see in the example below, I did match the answer key. EXAMPLE 1: USING SAS AND THE AMERICAN COMMUNITY SURVEY TO TEACH ABOUT RACE This example uses the 2009 American Community Survey Public Use Microdata Sample (U.S. Census, 2009). The Public Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). There are 3,030,728 records in the data set. For those of you who are extremely selfish and hate children, you are now beginning to see that, even for you, there are advantages to analyzing open data in that it provides opportunities to show your skills using large amounts of data. Step 0: Verify your data and use the correct weight Extensive detail on how to read data into SAS, verify your data quality and prepare your data for analysis are given in two other papers (De Mars, 2011a, De Mars, 2011b) so we’ll assume here that you are beginning with a nice SAS data set with no data problems. This step will produce the exact same estimates as the Census. LIBNAME lib “C:\Users\AnnMaria\Documents\2009Pums\sasdata” ; PROC FREQ DATA = lib.pums9 ; TABLES sex / OUT = testfreq ; WEIGHT pwgtp ; The PROC FREQ statement will invoke the procedure to create a frequency distribution. The TABLES statement specifies the variables for which you want frequencies. The OUT = will output the count and frequency for each level to a data set. The WEIGHT statement specifies the weight given to each observation and it is extremely important. If your counts come out to be wildly incorrect it is almost certain that you left out this statement. The code above will give you the extremely ugly output below: The FREQ Procedure Sex Cumulative Cumulative SEX Frequency Percent Frequency Percent -----------------------------------------------------1 1.5137E8 49.31 1.5137E8 49.31 2 1.5563E8 50.69 3.0701E8 100.00 Scientific notation as we all learned in some class that is now a distant memory, is of the form b a *10 and is used to represent either very large or very small numbers. Because computers and calculators had difficulty with super-scripts (remember, PROC FREQ dates back to the days of line printers), the number E has been used to stand in for “10 to the power of b”. 8 So, 1.5137E8 is equal to 1.5137 * 10 or 151,370,000. Now, this is very close to 151,373,350 but I want to be precise. This is where I am glad I saved the output to a data set. I can go to the explorer window, open my SAS data set and see this: My estimates of the population distribution for gender match exactly, right down to the person. STEP 1: Frequency Distribution by Race In 2009, people were allowed to check more than one race on their census forms. Now here is where I very strongly believe we teach statistics wrong, as just a set of facts and figures. Statistical analysis as most adults do it in their work is done to answer questions. We very, very seldom in classrooms begin discussions of statistics with, “What do you think?” Here are our four questions: What percentage of the population considers their race to be “Black” ? What percentage of the population considers their race to be “White”? What percentage of the population considers their race to be neither black nor white? What percentage of the population considers their race to be both? Don’t answer these questions. Ask the students their opinions. Allow (civil) arguments to break out. What you want to do is build drama. I learned this from a book by actor Alan Alda (2008) where he gave the example of building drama when walking across the stage holding a glass of water. On the second pass across the stage, he filled the glass to the top and told the student, “If you spill even one drop of this water, every person in your village will be killed.” His point was that not knowing what will happen adds drama to a situation. We have asked the students about an emotionally charged issue, race, had them go out on a limb a little bit to make guesses about it, even argue with their peers whose guess is right. After letting them debate for a while (but before name-calling starts, hopefully), the students are asked, “Well, do you want to know what I found when I analyzed the census data?” WE PAUSE BRIEFLY TO TALK ABOUT WEIGHTS, BECAUSE THEY ARE VERY IMPORTANT The first table is shown below. It can be see that those who are neither black nor white constitute 10.4% of the total population. People who consider themselves white, and no other race, are 76.3% of the population, 12.6% are black and less than 1% consider themselves to be both races. Table 1 Frequency Distribution by Race, Weighted Race includes Black Race includes White 2009 Population Percent of Population No Yes 234,175,873 76.3 Yes No 38,805,561 12.6 No No 31,876,214 10.4 Yes Yes 2,148,908 0.7 The SAS code to produce this table was discussed in a previous paper (De Mars, 2011b) on making better looking results, I only want to mention the PROC FREQ and, specifically, the WEIGHT statement. PROC FREQ DATA = lib.pums9 ; TABLES racblk* racwht / OUT = lib.blkwhitmix ; WEIGHT pwgtp ; The percentages above are correct. These are correct because I used the correct weights. What does it mean to “weight a sample”? What’s a sample, anyway? A population is everyone you are interested in, in this case, everyone in the United States. A sample is a part of the population, in this case, about 1% of the population was sampled for this survey, that is 3,030,728 out of 307,006,556, or 1 out 101, to be precise. Look at the results when we reproduce the table without the WEIGHT statement. Table 2 Frequency Distribution by Race, Unweighted Race includes Black Race includes White 2009 Population Percent of Population No Yes 2,415,930 79.7 Yes No 313,411 10.3 No No 283,380 9.4 Yes Yes 18,007 0.6 Obviously the population numbers are wildly off. You might think it would just be a simple matter of multiplying every number by 101. That would work to give us the correct total. However, compare the percentages in the first two tables. The percentage of population that checks White for race, and not black is higher than in the previous table. Every other group is lower. Let’s take a simple example to show the importance of not just weighting but correct weights. Let’s just pretend for the moment that America is 80% white, 10% black and 10% other and that we have 300,000,000 people in America. As you can see from the tables above those figures are not too far off. We collect a sample of 1,000,000 people. If we have a representative sample it will be 800,000 white + 100,000 black + 100,000 other What if that doesn’t happen? It usually doesn’t. Usually, we get instead something like this: 800,000 white + 50,000 black + 150,000 other ? In that case, every white person would have a weight of 300, every black person would have a weight of 600 and every “other” person would have a weight of 200. Table 3 Example of Weights for Hypothetical Sample Race Sample Weight Population Percent White 800,000 300 240,000,000 80% Black 50,000 600 30,000,000 10% Other 150,000 200 30,000,000 10% TOTAL 1,000,000 ---- 300,000,000 100% Why 600? The weight for the whole population was 300. Because black people were only half as likely to answer, we need to multiply their weight by two times. Why 200? Because the “other” group was 1.5 times as likely to answer, you need to multiply the weight by 2/3. In any decently designed and disseminated survey you won’t need to calculate the weights, that will already have been done and you just need to know which is the weight variable. As long as you use the correct weight, you will get the right answer. On the other, more depressing hand, if you don’t use the weight variable, you will get wrong answer. Does this mean that you personally need to figure out the weights? No, thank God. Any survey where this is important should have the weight variable already included. All you need to know is which variable in your data set is the weight variable, which you can almost always find out by doing a PROC CONTENTS. If all else fails, you can read the codebook or other documentation for the survey. Then just include that variable on your WEIGHT statement every time you do any kind of statistical analysis with SAS. THE MODE AND CATEGORICAL VARIABLES The mode is one measure of central tendency, that is the “center” or average of a distribution. It is easy to see that the mode is “white”. The mode, the most common score, is the only measure of “average” that makes sense when you are using categorical variables. The distinction between categorical and numeric variables is an important one in statistics. A categorical variable is one that differs only in quality, not quantity. A person can’t have “more” or “less” race. You can’t make ratio comparisons and say, “Juan is twice as much ‘other’ as Tanisha.” (For the SAS code to produce this table see the previous paper (De Mars, 2011b) .) If we were going to use this graph and talk about the “average American” , we would say the average American is white. EXAMPLE 2: USING SAS AND THE AMERICAN COMMUNITY SURVEY TO TEACH ABOUT DISTRIBUTIONS Answer this question and see if you did as well as the average urban middle school student. Out of 192 countries rated by the CIA Fact Book in income equality, where 1 = the most equal country in the world and 192 = the least equal country, where does the United States rank? Are you ready with your number? Below is our distribution of household income in the United States. This is a classic skewed distribution. When you have a skewed distribution, whether of income or anything else, you have most of the population at one end and then a very long “tail” going off in the other direction. This describes income distribution in the U.S. perfectly. Below is a graph of income distribution with all of the population making over $500,000 a year lumped together. It has the advantage of being able to read the numbers, but the disadvantage of masking how skewed household income really is. The answer to the question is 92. The United States is smack in the middle in terms of income equality, or inequality, depending on how you want to look at it. (This is the point where, with students, we could try different means of sorting the data, rounding to the nearest $50,000, lumping all of the people making over $250,000 together, and looking at how it affected the picture of our distribution. ) Our SAS code for the charts above is shown below. SAS CODE FOR FREQUENCY DISTRIBUTIONS FROM THE AMERICAN COMMUNITY SURVEY PROC FREQ DATA = hous.hus9 ; TABLES hincp / OUT = meddist; WHERE ten > '0'; WEIGHT wgtp ; ODS GRAPHICS ON ; DATA graphinc ; SET meddist ; INCOME = ROUND(hincp,10000) ; IF hincp < 500000 THEN Household_Income = income ; ELSE Household_Income = 500000 ; PROC FREQ DATA = graphinc ; TABLES Income Household_Income ; WEIGHT COUNT ; The benefits of creating an output dataset and using it for analysis were discussed in an earlier paper in this series (De Mars, 2011a). The first PROC FREQ step outputs the distribution of household income to a dataset named meddist. The WHERE statement only selects households where the variable ten (for tenure, as in, how long you have lived in this house) is greater than 0. The WEIGHT statement, which we’ll discuss more below, applies the appropriate weight. The ODS GRAPHICS ON statement will produce graphics for statistical procedures. In the DATA step, the variable income is created simply by using the ROUND function to round household income to the nearest $50,000. Household_Income is the same as the income variable, except that all households making $500,000 a year or more are lumped into a single category. That’s it. There are no additional steps required to producing the graphs. Once you have ODS GRAPHICS ON, the graphics are produced automatically for the procedures that follow until you turn graphics off. MEASURES OF CENTRAL TENDENCY, OR “WHAT IS AVERAGE?” There are three measures of central tendency, with the median and mean being the two most commonly used. The mean is generally preferred because it takes into account every score in the distribution. As you probably remember from some math class or other, to get the mean, you add up all of the numbers in a sample and divide by the number of people in the sample. In mathematical terms this is: ∑ Xi / N The median, on the other hand, is the midpoint of a distribution, the one that half of the people fall above and half fall below. When you have a skewed distribution, the median is preferred as a measure of central tendency. To understand why, think of this example. You have 21 people in a room, 20 of them are unemployed and have $0 in income. The second just had an IPO on his technology company and earned $21,000,000 this year. The mean income of those 21 people is $1,000,000. Hurray! Unemployment problem solved! Of course, in this case, the mean is thrown off by one person who is very extreme. A more accurate representation of the whole group of twenty people would be the median, which in this case is $0. Whenever you have a distribution with some very extreme scores (referred to as outliers) it is a better choice to use the median than the mean. How do we get the mean and median with SAS? There are several procedures you can use, PROC UNIVARIATE, PROC TABULATE, PROC MEANS or PROC SURVEYMEANS, not name just a few. Let’s get some actual data and try these procedures. The results below were produced with PROC UNIVARIATE. It actually produced three tables but since we are discussing basic statistics we’re going to look at only two. Table 4, of Weighted Basic Statistical measures, shown below, gives the mean, median, and mode. The mean is $69,000 and the median is $50,000. What does that tell you? Think back to our example with the 21 people and one person with $21,000,000. That one person, referred to as an outlier, really pulled up the mean of our distribution. This is exactly what we have happening with income in the United States. Table 4 Weighted Basic Statistical Measures Location Variability Mean 69025.63 Std Deviation Median 50000.00 Variance Mode 0.00 Range Interquartile Range 72846 5306585337 1777400 63200 The median household income, which is how much money everyone in the home gets put together is $50,000. That is, half the households have more income than that each year and half have less. The mean income, which we would get if we added everyone’s income and divided by the number of people, is $69,025. The mode, that is, the most common income is $0. MEASURES OF VARIABILITY The next table shows the maximum, minimum and selected percentiles of the distribution. The maximum household income for our sample was $1,749,000. That’s the most any household reported. The 99th percentile does not mean you got 99 percent correct Table 5 Weighted Quantiles Quantile Estimate 100% Max 1749000 99% 385000 95% 186500 90% 139400 75% Q3 88300 50% Median 50000 25% Q1 25100 10% 12000 5% 7600 1% 0 0% Min -28400 on a test. It means that you are higher than 99% of the population, or another way to put it is that you are in the top 1%. So, the top 1% of households in America receive $385,000 a year or more. The top 5% have $186,500 a year, or more. If your income is more than 50% of households in America, then you are making $50,000 a year. The fact that both the numbers are 50 is just a coincidence. The fact that it is the same as the median is no coincidence. The median and the 50th percentile are the exact same thing. The bottom 1% of households have an income of $0 and the very lowest income in the sample is -$28,400. That is actually accurate. You can have a negative household income, for example, if you own a business and your business loses money that year. The range is the difference between the minimum and maximum and if you subtract -$28,400 from $1,749,000 you get exactly our range of $1,777,400. All of this combines with the graphs we saw above to support our conclusion that income in America is quite skewed if you can take a sample of 1% of the population and get a range of $1.8 million from the highest to the lowest. Since we’re back looking at Table 4, let’s discuss the standard deviation, which is the average amount by which people differ from the mean. The standard deviation is $72,846. That’s a pretty large number. How can that be correct? The variance is $5,306,585,337 , which is a national debt size number. How can that be? Well, because the formula for the variance is: _ 2 ∑ (Xi - X) / (n - 1) In plain English, - that sideways W thing is the Greek letter sigma, or S and meaning “sum of” to all mathematician types everywhere. The Xi denotes “each individual’s score”. So, Xi is the score for the ith individual. X1 is for the first person, X2 for the second person and so on. The X with the bar over it is the symbol for the mean, which in this case is $69,025.63 . N stands for the number of people in our sample. Because we have the sample, and not the actual population, we need to divide by (n- 1). Truly, whether we divide by 3,030,727 or 3,030,728 is not going to make the slightest bit of difference but we statisticians like to be precise about things. So, in English, the formula for the variance is this “Take the sum of the squared differences from the mean and divide by the sample size minus one.” When you have very large differences from the mean, say $1,749,000 - $69,0256 and you square these you get very large differences squared. One million squared is one trillion. The problem everyone has with the variance is exactly that, it is squared, so it’s not on the same scale as the mean income. After all, we’re interested in how much the average person’s income differs from the mean, not how much is the squared difference from the mean. So, we take the square root of the variance, and that gives us the standard deviation, the formula for which is shown below. The square root of $5,306,585,337, by the way, is $72,846 which, not coincidentally, is the exact value shown for the standard deviation in Table 4. All of this, the standard deviation, which says the “average” difference from the mean is large, the skewed picture we saw in the histogram, the difference between the mean and the median, all of this comes together to point to a clear picture. We have very unequal distribution of income in America. SAS CODE FOR PROC UNIVARIATE FROM THE AMERICAN COMMUNITY SURVEY PROC UNIVARIATE DATA = hous.psam09 VARDEF = WGT ; VAR hincp ; WEIGHT wgtp ; The PROC UNIVARIATE statement calls the univariate procedure, which produces, you guessed it, univariate statistics. The DATA = option specifies the data set. The VARDEF = WGT option is very important in this case. This specifies that SAS will use the sum of weights for the denominator of the variance. If you leave it off, you will get the correct mean but your variance, and the standard deviation, which is the square root of the variance, will be wrong. The VAR statement specifies the variables for which we want univariate statistics. In this case, there is just one, household income. The WEIGHT statement gives the weight variable. If you leave it off, your mean, median and percentiles will almost certainly be wrong. EXAMPLE 3: USING SAS AND THE AMERICAN COMMUNITY SURVEY TO TEACH BI-VARIATE ANALYSES MEASURES OF CENTRAL TENDENCY OF INCOME BY RACE, USING GRAPH-N-GO Up to this point we have been discussing univariate analyses, that is analysis of one variable at a time, looking at household income or race. Let’s move on now to bi-variate analyses, which are simply looking at two variables at a time. We saw, in our first example, that there are differences by race in the likelihood of being in the sample, which is why we needed the weights. We saw that there is an unequal distribution of income, which is very skewed. Do you suppose there is a difference by race? To answer the question of whether race matters in income, I used another open data set, the American Community Survey data for the state of California. This graph shows the mean income by race. Although the census data says that Hispanics can be of any race, I noticed when I analyzed the data for California that many more people than in the nation as a whole put “Other” for their race and almost all of those people were Hispanic. Although the Census Bureau does not consider Hispanic to be a race, many Hispanics clearly did. So, I broke the data down into the four largest groups in California, Asian, Black, Hispanic, Other and White. Then, I looked at the mean income for each group. This is not household income, but rather personal income. As you can see, the answer to the question, “Does race matter?” is clearly, “Yes.” We just discussed the fact that median is a better measure than mean for skewed distributions. Maybe there are some really, really rich white people in California that are pulling off the mean. To test if this was the case, I re-ran the analysis but this time I selected the median personal income by race. This gave me the following chart. As you can clearly see, there are still very large differences in race, although everyone’s income is lower. PRODUCING THE CHARTS OF INCOME BY RACE First of all, you should note that Graph-N-Go hates large data sets. I tried opening a data set with several hundred thousand records and it ended up crashing. To prevent this from happening, first create a dataset using PROC SUMMARY for means and median. Because Graph-N-Go uses whatever formats and labels are stored with the data set, use a DATA step to define these. PROC SORT DATA = lib.california ; BY RACE ; PROC SUMMARY DATA = lib.california MEAN MEDIAN ; VAR income ; BY race ; OUTPUT OUT = examp MEDIAN = median_income MEAN = average_income ; WEIGHT pwgtp ; DATA examp ; SET examp ; LABEL Average_Income = "Income" Median_income = "Median Income" ; FORMAT average_income median_income DOLLAR8.0 ; The statements above will create a data set with variables, median_income and average_income, with one record for each race. The SORT step sorts the data set by race and it must be sorted or the next step will give you an error. The PROC SUMMARY step is identical to PROC MEANS except that the default is not to produce printed output. The MEAN and MEDIAN options in the PROC SUMMARY statement request that these two statistics be calculated. The VAR statement specifies the variables for which you want these statistics, in this case it’s only income. The BY statement requests the statistics by the variable(s) specified, in this case, it’s only income. The OUTPUT statement names the output data set , the statistics to be written out to the data set and names for those statistics. Pay attention here. You’d think just because you specified MEDIAN, for example, in the PROC SUMMARY statement that, obviously, median should be written out to the data set. You would be wrong. The DATA step applies the formats and labels I want used for the chart. Once the data set is created, go to Graph-N-GO, in the SOLUTIONS menu under REPORTING. To create your chart, drag the BAR GRAPH icon over to the graphing window, right-click on the empty box and select PROPERTIES. Select the DATA MODEL to use (this is the output data set you created above, in this case work.examp), select the CATEGORY, in our case we want race. Select the RESPONSE variable, which is income. Select the statistic. For click-by-click directions on how to use Graph-N-Go see De Mars (2010). The selection shown in the figure above produced our first graph, of mean income. There is one little catch here. Graph-N-Go doesn’t have the choice for a “Median” statistic. How did we get the second graph, of median income? Well, remember that the median statistic was actually created in our PROC SUMMARY step above. In fact, there is only one record for each race. So, when I select “AVERAGE”, it is really just going to show the value of that one number. I could have selected the SUM statistic and it still would have given me the same number. EXAMPLE 4: USING SAS, JMP AND THE AMERICAN COMMUNITY SURVEY TO STUDY INCOME & RACE Everyone’s income is much lower when I used the median. This brings up a really, really important point in statistics. You should always know who your population is. Why is the median income so low? In this graph, I have included everyone in the state and compared the incomes by race. Should I have included everyone? What about people under age 16 or over age 65? They won’t be working, will they? It is a fact that Hispanics are significantly younger than the non-Hispanic population, African-Americans are significantly older than the white population. Could the differences in income be due to differences in age? To answer this question, I used JMP to create a chart of income by age for race. So, the answer to the question, “Does race matter?” is “Yes.” The answer to the question of, “Can this be due to age?” is “Not entirely, for sure, because when you control for age, the differences still persist. You can see that the curves for each race are somewhat similar. Before 16, no one is making any money. From age 16 to 30 -50 (depending on the race), income goes up. Then, around age 60, income starts to drop as people retire. You can see that for whites and Asians the curve goes up more steeply than for the other three groups. Also, you can see that even if you control for age, the incomes for whites and Asians are higher. What else could explain the difference in income? Yes, racism is one answer, but are there others? To answer this question, we take a look at one more graph, also created with JMP. This is the same as our previous chart but we’re going to look mean income by education. What we can clearly see is that much, although not all, of the difference in income disappears when you control for education. Once they have an MD, law degree of Ph.D., African-Americans and Asian-Americans make about the same amount. Non-Hispanic whites still make more than other groups, at every level of education, but the differences are greatly reduced. What about that little drop down at the end, that shows for all racial/ ethnic groups except for Hispanics that Ph.Ds make less than those with MDs and law degrees? My suspicion is that there are a few people making millions who are impacting the mean. (Remember our discussion about means and medians?) The chart below shows median income by education and I can see that there is a straight-line relationship. The more education you get, the more money you make. BI-VARIATE GRAPHS BY RACE USING JMP Graphics with Graph-N-Go are easy but distinctly limited. SAS offers several other options for graphics. One that combines ease and flexibility is JMP. That’s the good news. The bad news is, if you want to do much programming, you have to learn a whole new language called JSL. OR ... you could use SAS to create your data set to analyze, as in all of the other examples above. PROC SUMMARY DATA = lib.california MEAN MEDIAN ; VAR income ; BY race age ; OUTPUT OUT = incomerace MEDIAN = median_income MEAN = average_income ; WHERE age > 15 ; All that we have added to the previous example is the WHERE statement. To export a file as a JMP data set, simply select from the FILE menu EXPORT DATA, point and click through the menus to select the data set to export (in this case, incomerace ) and then select JMP file from the drop-down menu as the type of data set to export. If your organization does not have JMP licensed, two other options for examining statistics graphically are SAS Enterprise Guide and SAS/GRAPH. Of the two, SAS/ Graph is far more flexible but SAS Enterprise Guide has a much gentler learning curve. EXAMPLE 5: TESTS OF SIGNIFICANT DIFFERENCE USING OPEN DATA ON OLYMPIC SPORTS PROGRAM So far, we have been using “descriptive statistics”. As the name implies, descriptive statistics simply describe what we observed. When we start to make inferences about the population as a whole, we are moving into the realm of “inferential statistics”. A main focus of inferential statistics is the determination of whether or not a result is significant. What significance means to a statistician is not “very important”. In any two groups, it is not at all unexpected to have some difference occur completely randomly. Say that in the population, the means are exactly equal in two groups. Still, on any given day, some of the females would have stayed sick, would have not turned in their paperwork in time to be allowed to compete, not had money to travel to the event or for hundreds of other possible reasons, missed the competition. Males, too, would get sick, lose paperwork, not have money and so on. The result is that rarely are samples from two groups exactly equal, even when, if we had everyone in the population, we’d find that the group means are equal? What’s a poor statistician to do? The answer is that we calculate a test statistic and then find the probability of getting a statistic that large if the true difference in the population is zero. Many of the most common statistics, also called parametric statistics, make an assumption of a normal distribution. That is, that the distribution is not terribly skewed. As we have already seen, at length, income doesn’t fit the assumption of normality at all. There are ways to get around this assumption, but the easiest, since we are just learning here, is to select a data set that does fit the assumption. The Census Bureau is not the only source for open data, nor are all open data sets extremely large. Many smaller non-profits are eager to have data analyzed to answer questions of interest to them. This next data set is of athletes competing in judo in the U.S. national championships from 1990 - 2011. Below we take a look at the frequency distribution of the number of male athletes competing during this period and what we see is a very nice, normal distribution. One characteristic of a normal distribution is that the mean = the median = the mode. As you can see in the graph below, the mode occurs right around 220, the mean is 222, and the median falls in that same interval around 220 competitors. In a normal distribution, 95% of the population will fall within two standard deviations of the mean. Also, normal distributions are symmetrical, with observations occurring above the mean and below the mean with equal frequency. SAS CODE FOR THE HISTOGRAM Producing the histogram above was a piece of cake. I simply used this statement: ODS GRAPHICS ON ; PROC FREQ data = athletes ; TABLES competitors ; WHERE sex = “Males” ; If you have yet to try ODS statistical graphics, I highly recommend you give it a look. All you need to do is include the statement ODS GRAPHICS ON , before your statistical procedures and SAS will automatically produce the most commonly requested graphics, with no additional programming required. T-TEST PROCEDURE FOR TESTING FOR DIFFERENCE BETWEEN MEANS The first question of interest to our non-profit organization is whether there is a significant difference between the numbers of competitors in the male and female divisions each year. To test for the difference in means between two groups, we compute a t-test. A t-test will give four tables of results. The first one is shown below. Table 6 First PROC TTEST Table sex N Mean Std Dev Std Err Minimum Maximum Female 22 97.0909 15.8201 3.3729 66.0000 119.0 Male 22 222.8 45.2253 9.6421 146.0 324.0 -125.7 33.8792 10.2150 Diff (1-2) There were 22 records for males and 22 for females. The mean number of competitors each year was 97 for females, with a standard deviation of 15.8 , with a range from 66 to 119. For males, the mean number of competitors was almost 223 per year, with a standard deviation of 45, and a range from 146- 324. What exactly is a standard deviation mean? A standard deviation is the average amount by which observations differ from the mean. So, if you pulled out a year at random, you wouldn’t expect it to necessarily have exactly 222.8 male competitors. In fact, it would be pretty tough on that last guy who was the .8! On the other hand, you would be surprised if that year there were only 150 competitors, or if there were 320. On the average, a year will be within 45 competitors of the 223 male athletes and 95% of the years will be within two standard deviations, or, from 132 to 312. That is, 223 - ( 2 x 45) to 223 + (2 x 45) . WHAT IS THE STANDARD ERROR AND WHAT DETERMINES IT? But what is the standard error? The standard error is the average amount by which we can expect our sample mean to differ from the population mean. If we take a different sample of years, say, 1988- 2009, 1991- 2012, all odd numbered years for the last 30 years and so on, each time we we’ll get a different mean. It won’t be exactly 97.09 for women and 222.8 for men. Each time, there will be some error in estimating the true population value. Sometimes we’ll have a higher value than the real mean. Sometimes we’ll underestimate it. On the average, our error in estimate will be 9.64 for men, 3.37 for women. Why is the standard error for women lower? Because the standard deviation is lower. The standard error of the mean is the standard deviation divided by the square root of N, where N is your sample size. The square root of 22 is 4.69. If you divide 15.82 by 4.69, you get 3.37. Why the N matters seems somewhat obvious. If you had sampled several hundred thousand tournaments, assuming you did an unbiased sample, you would expect to get a mean pretty close to the true population. If you sampled two tournaments, you wouldn’t be surprised if your mean was pretty far off. We all know this. We walk around with a fairly intuitive understanding of error. If a teacher gives a final exam with only one or two questions, students complain, and rightly so. With such a small sample of items, it’s likely that there is a large amount of error in the teacher’s estimate of the true mean number of items the student could answer correctly. If we hear a survey found that children of mothers who ate tofu during pregnancy scored .5 points higher on a standard mathematics achievement test, and then find out that this was based on a sample of only ten people, we are skeptical about the results. What about the standard deviation? Why does that matter? The smaller the variation in the population, the smaller error there is going to be in our estimate of the means. Let’s go back to our sample of mothers eating tofu during pregnancy. Let’s say that we found that children of those mothers had .5 more heads. So, the average child is born with one head, but half of these ten mothers had babies with two heads, bringing their mean number of heads to 1.5. I’ll bet if that was a true study, it would be enough for you never to eat tofu again. There is very, very little variation in the number of heads per baby, so even with a very small N, you’d expect a small standard error in estimating the mean. The second table produced by the TTEST procedure is shown in Table 7 below. Here we have an application of our standard error. We see that the mean for females is 97, with a 95% Confidence Level (CL) from 90.07 to 104.1. That 95% is the mean minus two times the standard error, plus two times the standard error. That is, 97.09 - (2 x 3.37) to 97.09 + (2 x 3.37). Why does that sound familiar? Perhaps because it is exactly what we discussed on the previous page about a normal distribution? Yes, errors follow a normal distribution. Errors in estimation should be equally likely to occur above the mean or below the mean. We would not expect very large errors to occur very often. In fact, 95% of the time, our sample mean should be within two standard errors of the mean. Table 7 Second PROC TTEST Table sex Method Female Mean 97.0909 Male 95% CL Mean 90.0766 Std Dev 95% CL Std Dev 104.1 15.8201 12.1713 22.6080 222.8 202.7 242.8 45.2253 34.7941 64.6299 Diff (1-2) Pooled -125.7 -146.3 -105.1 33.8792 27.9348 43.0609 Diff (1-2) Satterthwaite -125.7 -146.7 -104.7 The next two lines both say Diff (1-2) and both show the difference between the two means is -125.7. That is, if you subtract the mean for the number of male competitors from the mean number of female competitors, you get negative 125.7. So, there is a difference of 125.7 between the two means. Is that statistically significant? How often would a difference this large occur by chance? To answer this question we look at the next table. It gives us two answers. The first method is used when the variances are equal. If the variances are unequal, we would use the statistics shown on the second line. In this instance, both give us the same conclusion, that is, the probability of finding a difference between means this large if the population values were equal is less than 1 in 10,000. That is the value you see under the PRobability > absolute value of t. If you were writing this up in a report, you would say, “There were, on the average 126 fewer female competitors each year than males. This difference was statistically significant (t = -12.30, p <.0001).” Table 8 Third PROC TTEST Table Method Variances Pooled Equal Satterthwaite Unequal DF t Value Pr > |t| 42 -12.30 <.0001 26.064 -12.30 <.0001 In this case the t-values and probability are the same, but what if they are not? How do we know which of those two methods to use? This is where our fourth, and final table from the TTEST procedure comes into use. This is the test for equality of variances. The test statistic in this case is the F value. We see the probability of a greater F is < .0001. This means that we would only get an F-value larger than this 1 in 10,000 times if the variances were really equal in the population. Since that is a really large number, and the normal cut-off for statistical significance is p < .05 and .0001 is a LOT less than .05, we would say that there is a statistically significant difference between the variances. That is, they are unequal. We would use the second line in Table 8 above to make our decision about whether or not the differences in means are statistically significant. Table 9 Fourth PROC TTEST Table Equality of Variances Method Folded F Num DF Den DF F Value Pr > F 21 21 8.17 <.0001 SAS CODE FOR THE TTEST PROCEDURE PROC TTEST DATA = athletes ; CLASS sex ; VAR competitors ; PROC TTEST requests the t-test be performed using data from the specified data set. The CLASS statement gives the variable that identifies the two groups being compared. There must be two groups. No more, no less. The VAR statement gives the variable that is being compared. There can be as many variables as you like, but it must be numeric, since you are comparing the means. EXAMPLE 6: TESTS OF LINEAR TREND USING OPEN DATA ON OLYMPIC SPORTS PROGRAM One last question our non-profit organization would like to know. They have data showing that the difference between male and female competition has been declining over time. However, they suspect that this is not due to an influx of female competitors as much as a decline in male competitors. SAS offers several procedures to test for a linear trend, including PROC CORR, PROC GLM and PROC REG. PROC GLM and PROC REG will always give the identical results in terms of statistical significance, size of relationship. PROC CORR is primarily used only to compare two variables, and in case such as this one, will also give the identical result with GLM and REG. In this case, I used PROC REG. It provides several tables and charts, but there are only a few that are really needed to answer the question. First, is there a linear trend? In a word, yes. In Table 10 we see the parameter estimates produced by the REG procedure. Table 10 Parameter Estimates from PROC REG Parameter Estimates Parameter Estimate Standard Error Variable DF Intercept 1 10627 year 1 -5.20102 Standardized Estimate t Value Pr > |t| 2072.01102 5.13 <.0001 0 1.03574 -5.02 <.0001 -0.74678 We are really interested in the variable year, and we can see that it has a significant relationship with the dependent variable, which is the number of male competitors, with a probability of getting a greater t-value of less than .0001. The parameter estimate of -5.201 tells us that every year, we are predicted to have 5.2 fewer male competitors. Because the parameter estimate is negative, as the year value goes up, the number of male competitors goes down. The standardized estimate, shown in the last column, tells us that for every standard deviation increase in year, there is a .75 standard deviation drop in the number of competitors. The standard deviation for year is 6.5, so every 6.5 years, there will be 33 fewer male competitors (.75 * the standard deviation of 45). In this particular case, the standardized estimate isn’t really any more useful than our original parameter estimate, but that will not always be so. Assume we had two variables, one is year and the other is dollars in prize money available to players winning international medals. That could be from $0 to $40,000. If the parameter estimate for dollars is .052 and the estimate for year is -5.20, does that mean that the trend by year is 100 times more important than the amount of prize money available? After all, the number of competitors is going down 5.2 for every year and only going up .052 for every dollar in additional prize money. A reasonable person might argue that doesn’t make sense, there is a lot more variation in the number of dollars than the number of years. You can’t compare the two directly. You can’t. This is where the standardized estimate would be used. PROC REG also produces several graphs. The fit plot, shown below, plots a linear trend. The blue bands are the 95% confidence interval. We can see that several points fall outside of that interval. Those points spaced above and below the blue bands are about equal width apart. Generally, you would not see as many points falling outside the confidence bands. Even though the relationship is statistically significant and the standardized estimate is quite high, there seems to be more of a cyclical trend. In discussing this with the historian from the organization, he identifies the high points as coming prior to Olympic years, when athletes are attempting to qualify for the Olympic games, and the low point as coming just after the Olympics, when many athletes decide to retire from competition. As for the high point in 1995, this was not only a pre-Olympic year, but the national championships were held in Hawaii and “Everyone likes to go to Hawaii.” This ability to spot cyclical trends and outliers is one benefit of ODS statistical graphics. Combined with the statistical tables, graphics can give a fuller picture. SAS CODE FOR THE TTEST PROCEDURE ODS GRAPHICS ON ; PROC REG DATA = athletes ; WHERE sex = “Male”; MODEL competitors = year / stb ; CONCLUSION Whether learning statistics for the first time, or presenting statistical data to a group who, at least initially, have no more interest in the subject than the average hamster, the first exposure to statistics can be a challenge. This challenge is increased by the desire of most learners to analyze “real data” , to see real world applications of statistics. The good news is that some of this real world data actually comes with “answer keys” that allow novices to check their results against published statistics. The better news is that, while learning statistics, these data can be useful to help and inform their community. Several examples have been given in this paper. With the wealth of open data sources available, the only limit is the programmer’s time and creativity. The even better news is that coding the procedures using SAS is usually the easiest part of the process by far. Use of graphics options such as Graph-N-Go, ODS statistical graphics and JMP can provide a broader view to show that, with the latest software, statistics are more than just numbers and to give a bigger, fuller picture - literally. REFERENCES Alda, A. (2007. Things I overheard while talking to myself. Random House. Dehaene, S. (2011). The Number Sense: How the Mind Creates Mathematics. Oxford University Press. De Mars, A. (2010). From Novice to Intermediate in (Approximately) Sixty Minutes: III. Presentation. Paper presented at the annual meeting of Western Users of SAS Software. San Diego, CA. ® De Mars, A. (2011a). SAS Functions for a Better Functioning Community. Paper presented at the annual meeting of Western Users of SAS Software. San Francisco, CA. De Mars, A. (2011ab. SAS Essentials II: Better-looking SAS for a better community. Paper presented at the annual meeting of the Western Users of SAS software. Hersh, R. & John-Steiner, V. (2011). Loving and hating mathematics: Challenging the myths of mathematical life. Princeton, NJ : Princeton University Press. U.S. Census Bureau (2009). ACS 2009 1-Year PUMS File Readme I.) Overview of the Public Use Microdata Sample files (PUMS). http://www.census.gov/acs/www/data_documentation/pums_documentation/ ACKNOWLEDGMENTS Thank you to Kirby Posey of the U.S. Census Bureau for invaluable assistance in verifying the variable coding and estimates. Thanks also to Jerry Hays, United States Judo Federation historian for provision of historical data on competitors in Olympic weight divisions. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: AnnMaria De Mars The Julia Group 2111 7th St. #8 Santa Monica, CA 90405 (310) 717-9089 [email protected] http://www.thejuliagroup.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.