Download SAS Essentials III: Statistics for Hamsters

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
SAS Essentials III: Statistics for Hamsters
AnnMaria De Mars, The Julia Group, Santa Monica, CA
ABSTRACT
"For next year, I would like to see a workshop offered on statistics so easy a hamster can understand it. Bring your own
hamster." - Workshop Attendee Here it is! No actual hamsters were involved, but the statistics in this session were all previously
presented to many classes of middle school students (who sometimes have the attention span of hamsters). The examples use the
national dataset and a 1% sample of California residents, data from the American Community Survey, downloaded from the U.S.
Census website. Teachers suggested questions of interest to students, including employment, income and education by race and
ethnicity. Statistics were produced using SAS. Graphics were created using JMP 8 and SAS 9.2. These results were incorporated as
part of a lesson that used SAS output and graphics to illustrate the concepts of frequency distribution, histogram, mean, median,
mode, pie charts, correlation, sample selection and group differences. This session is recommended not only for those interested in a
refresher in basic statistics but also for anyone who would like to apply their SAS skills to supporting their local schools through
volunteering as a guest speaker. Integrating SAS with the curriculum can show students applications of programming and statistics to
the social studies, science and mathematics they are learning in schools and to issues important in their own lives. Try it! You may be
overwhelmingly surprised by the welcome you get from teachers and students alike.
INTRODUCTION
My favorite comment on workshop evaluations from last year’s Western Users of SAS Software conference was in answer to the
question about what the attendees would like to see next year. One person wrote,
“For next year, I would like to see a workshop offered on statistics so easy a hamster can understand it. Bring your own
hamster."
There are several reasons you may wish to have hamster-level statistics. First, you never had a statistics course, or it was so
long ago that you were sitting between Fred Flintstone and Barney Rubble. You know that the mean is a bad measure for average
income, but aren’t certain why. Perhaps, you have young relatives and have been drafted to explain sampling error to a sixth-grader
who has never heard the term in his or her life. If you believe this is easy, your own sample of sixth-graders studied must be very
small. Maybe you need to explain statistics to co-workers who, while significantly smarter than hamsters are no more interested in
statistics. Third, you may be interested in volunteering to help your local schools, but you’d like some information, activities and
examples that might help you, rather than just showing up and saying, “I’m a statistician and I’m here to help.”
Whether you’re learning to use statistics for the first time, or trying to explain it to someone else, adding the need to learn the
SAS code to create the statistics and then interpret the output adds a whole additional layer of time. No actual hamsters were
involved, but the statistics in this session were all previously presented to 18 classes of middle school students (who sometimes have
the attention span of hamsters). SAS code is provided.
The project discussed in this paper came through the juxtaposition of three random facts; I read the Los Angeles Times, I’m fully
convinced of the possibilities of open data and at one point, I worked on the third floor of a building where I could look down directly
into a low-performing urban middle school. You’d have to be on a desert island not to be aware of the crisis in American education.
As we have already established, I was not on a desert island but rather in an office building in downtown Los Angeles reading the
Times on my iPad when I probably should have been writing a SAS program. There is also some evidence that we have forced
teachers to focus so much on students’ ability to answer certain types of test questions that we haven’t allowed them time to teach
some really important concepts, like how to formulate their own questions, how to apply the information they are learning.
Both of these problems - children lacking support for education, and teachers without adequate time to spend on other than rote
learning – are compounded and almost insurmountable at certain schools. Added to all of this is substantial research to show that the
idea that some people “are just not good at math” is a myth (Hersh & John-Steiner, 2011). On the contrary, people who are very good
at math just spend a lot of time doing it (Dehaene, 2004). This all bothered me, so …. I called a teacher at one of the middle schools
featured in the newspaper and offered my assistance. Since then, I have given some version of this presentation 18 times, at three
different urban middle schools in two states.
FREQUENCY DISTRIBUTIONS (BEFORE THE SEMI-COLON
PART)
Before starting any presentation with statistics it is crucially
important that you explain each term, and, as much as possible, get
the students involved. As far as the initial activity to illustrate what is
a frequency distribution, I tried to come up with something everyone
could answer – and that kids would be interested in the answer, both
because it isn’t necessarily something you know about everyone in
your class, and also because in middle school, students are very
interested in whether or not they are “normal”, in how they fall relative
to others. I began by asking each student how many people lived in
his or her home. I drew a graph on the board and as each person told
me the answer, I put an X on the graph to indicate their family.
The very first point I want the students to understand is that
each of those points represents something about one person. The second point is that what it represents is the answer to a question.
There is no one at zero or one because, at the very least, you must live in your home, and in the seventh grade, no one lives alone.
So, one of the first uses of statistics can be seen right here. You can tell if people are lying right away if you see out-of-range values.
So, now we have our histogram which is the chart of the answers people gave by the frequency of those answers. Some people
also call this a bar chart. (This is the part where I draw bars around the X’s). The mode is the most common score. We can see on our
chart here that the mode is --- (ask for a student volunteer) --- the mode is 5. Any time you’re looking at a chart of the data, it’s easy to
see which is the most common score. It is the highest bar. If there are two that are equally high, your distribution is called bi-modal,
which means it has two modes. The median is the score that half of the people score below, and half score above. There are 17
people in this class. If we look at everyone in a distribution, from lowest to highest, the ninth person will be at the median, 8 people
will be higher and 8 people will be lower. The median in this distribution is also 5. The median and mode are two measures of what it
means to be average, what statisticians refer to as “central tendency”, that is what does the center, or average, tend to be like?
OPEN DATA TO INTRODUCE STATISTICS
THE CENSUS BUREAU OFFERS AN ANSWER KEY FOR REAL-LIFE, OR DO TRY THIS AT HOME
Now that we have an understanding of distributions and measures of central tendency, let’s use open data to pursue this a little
more in-depth. “Open data”, is data freely available to anyone to use or publish. I am a big proponent of the use of open data, and all
of the examples presented here use these data. There are many advantages of using open data, principal among them is that it is
free, as in free beer, and exists in a dazzling diversity of sizes, topics and formats. No matter what SAS statistic or technique you want
to learn to use, there is an open data set on the Internet you can download and use for it.
It can be disconcerting for new graduates performing statistical analyses to realize there is no one to tell them if their results are
correct. I still remember the meeting with my graduate advisor shortly before I sent off my first article for publication in a scientific
journal. I asked him if the results section was correct. He looked at me over the top of his glasses and said, “Well, I certainly hope so.”
Then he added, “Young lady, there’s no answer key for life.”
That was before open data. An often overlooked advantage for anyone just beginning to use SAS for large data sets, or
statistics, is that there may be published statistics for at least some of your analyses to check your results against to see if you are on
the right track. For example, the U.S. Census Bureau publishes results for some selected variables on-line for you to check your
results. Go to the PUMS documentation website http://www.census.gov/acs/www/data_documentation/pums_documentation/
Click the + next to “Help with Using PUMS” to expand this category.
Click on the link in the sentence “Data users who have doubts about the way they are computing estimates should attempt to
reproduce the estimates that are provided in the Verification Files availabe in PUMS documentation “ and you’ll see the options for
user verification.
Click the LST option (the second one) for the year that you are using and a page will pop up that tells you the correct estimates
for the U.S. and each state. Voila! Answer key for real life. Your results should match exactly with what is in that first column.
For example, I selected the 2009 Public Use Microdata Set (PUMS). In the LST file under estimates for the United States, I see this:
State of current residence=00
State=United States
Characteristic
Total population
Total males (SEX=1)
Total females (SEX=2)
2009 PUMS
Estimate
307,006,556
151,373,350
155,633,206
2009 PUMS 2009 PUMS
SE
MOE
289
476
19915
32760
19963
32839
When I ran my SAS program, compared the results I obtained by the estimates for total population, males and females. As you
can see in the example below, I did match the answer key.
EXAMPLE 1: USING SAS AND THE AMERICAN COMMUNITY SURVEY TO TEACH ABOUT RACE
This example uses the 2009 American Community Survey Public Use Microdata Sample (U.S. Census, 2009). The Public
Use Microdata Sample (PUMS) contains a sample of actual responses to the American Community Survey (ACS). There are
3,030,728 records in the data set. For those of you who are extremely selfish and hate children, you are now beginning to see that,
even for you, there are advantages to analyzing open data in that it provides opportunities to show your skills using large amounts of
data.
Step 0: Verify your data and use the correct weight
Extensive detail on how to read data into SAS, verify your data quality and prepare your data for analysis are given in two other
papers (De Mars, 2011a, De Mars, 2011b) so we’ll assume here that you are beginning with a nice SAS data set with no data
problems. This step will produce the exact same estimates as the Census.
LIBNAME lib “C:\Users\AnnMaria\Documents\2009Pums\sasdata” ;
PROC FREQ DATA = lib.pums9 ;
TABLES sex / OUT = testfreq ;
WEIGHT pwgtp ;
The PROC FREQ statement will invoke the procedure to create a frequency distribution.
The TABLES statement specifies the variables for which you want frequencies. The OUT = will output the count and frequency
for each level to a data set.
The WEIGHT statement specifies the weight given to each observation and it is extremely important. If your counts come out to
be wildly incorrect it is almost certain that you left out this statement.
The code above will give you the extremely ugly output below:
The FREQ Procedure
Sex
Cumulative Cumulative
SEX Frequency Percent Frequency
Percent
-----------------------------------------------------1
1.5137E8
49.31
1.5137E8
49.31
2
1.5563E8
50.69
3.0701E8
100.00
Scientific notation as we all learned in some class that is now a distant memory, is of the form
b
a *10
and is used to represent either very large or very small numbers. Because computers and calculators had difficulty with super-scripts
(remember, PROC FREQ dates back to the days of line printers), the number E has been used to stand in for “10 to the power of b”.
8
So, 1.5137E8 is equal to 1.5137 * 10 or 151,370,000.
Now, this is very close to 151,373,350 but I want to be precise. This is where I am glad I saved the output to a data set. I can go
to the explorer window, open my SAS data set and see this:
My estimates of the population distribution for gender match exactly, right down to the person.
STEP 1: Frequency Distribution by Race
In 2009, people were allowed to check more than one race on their census forms. Now here is where I very strongly believe we
teach statistics wrong, as just a set of facts and figures. Statistical analysis as most adults do it in their work is done to answer
questions. We very, very seldom in classrooms begin discussions of statistics with, “What do you think?”
Here are our four questions:
What percentage of the population considers their race to be “Black” ?
What percentage of the population considers their race to be “White”?
What percentage of the population considers their race to be neither black nor white?
What percentage of the population considers their race to be both?
Don’t answer these questions. Ask the students their opinions. Allow (civil) arguments to break out. What you want to do is build
drama. I learned this from a book by actor Alan Alda (2008) where he gave the example of building drama when walking across the
stage holding a glass of water. On the second pass across the stage, he filled the glass to the top and told the student, “If you spill
even one drop of this water, every person in your village will be killed.” His point was that not knowing what will happen adds drama to
a situation.
We have asked the students about an emotionally charged issue, race, had them go out on a limb a little bit to make guesses
about it, even argue with their peers whose guess is right. After letting them debate for a while (but before name-calling starts,
hopefully), the students are asked, “Well, do you want to know what I found when I analyzed the census data?”
WE PAUSE BRIEFLY TO TALK ABOUT WEIGHTS, BECAUSE THEY ARE VERY IMPORTANT
The first table is shown below. It can be see that those who are neither black nor white constitute 10.4% of the total population.
People who consider themselves white, and no other race, are 76.3% of the population, 12.6% are black and less than 1% consider
themselves to be both races.
Table 1
Frequency Distribution by Race, Weighted
Race
includes
Black
Race
includes
White
2009
Population
Percent
of
Population
No
Yes
234,175,873
76.3
Yes
No
38,805,561
12.6
No
No
31,876,214
10.4
Yes
Yes
2,148,908
0.7
The SAS code to produce this table was discussed in a previous paper (De Mars, 2011b) on making better looking results, I only
want to mention the PROC FREQ and, specifically, the WEIGHT statement.
PROC FREQ DATA = lib.pums9 ;
TABLES racblk* racwht / OUT = lib.blkwhitmix ;
WEIGHT pwgtp ;
The percentages above are correct. These are correct because I used the correct weights. What does it mean to “weight a
sample”? What’s a sample, anyway? A population is everyone you are interested in, in this case, everyone in the United States. A
sample is a part of the population, in this case, about 1% of the population was sampled for this survey, that is 3,030,728 out of
307,006,556, or 1 out 101, to be precise. Look at the results when we reproduce the table without the WEIGHT statement.
Table 2
Frequency Distribution by Race, Unweighted
Race
includes
Black
Race
includes
White
2009
Population
Percent
of
Population
No
Yes
2,415,930
79.7
Yes
No
313,411
10.3
No
No
283,380
9.4
Yes
Yes
18,007
0.6
Obviously the population numbers are wildly off. You might think it would just be a simple matter of multiplying every number by
101. That would work to give us the correct total. However, compare the percentages in the first two tables. The percentage of
population that checks White for race, and not black is higher than in the previous table. Every other group is lower.
Let’s take a simple example to show the importance of not just weighting but correct weights. Let’s just pretend for the moment that
America is 80% white, 10% black and 10% other and that we have 300,000,000 people in America. As you can see from the tables
above those figures are not too far off. We collect a sample of 1,000,000 people. If we have a representative sample it will be
800,000 white + 100,000 black + 100,000 other
What if that doesn’t happen? It usually doesn’t. Usually, we get instead something like this:
800,000 white + 50,000 black + 150,000 other ?
In that case, every white person would have a weight of 300, every black person would have a weight of 600 and every “other” person
would have a weight of 200.
Table 3
Example of Weights for Hypothetical Sample
Race
Sample
Weight
Population
Percent
White
800,000
300
240,000,000
80%
Black
50,000
600
30,000,000
10%
Other
150,000
200
30,000,000
10%
TOTAL
1,000,000
----
300,000,000
100%
Why 600? The weight for the whole population was 300. Because black people were only half as likely to answer, we need to multiply
their weight by two times. Why 200? Because the “other” group was 1.5 times as likely to answer, you need to multiply the weight by
2/3. In any decently designed and disseminated survey you won’t need to calculate the weights, that will already have been done and
you just need to know which is the weight variable.
As long as you use the correct weight, you will get the right answer. On the other, more depressing hand, if you don’t use the weight
variable, you will get wrong answer. Does this mean that you personally need to figure out the weights? No, thank God. Any survey
where this is important should have the weight variable already included. All you need to know is which variable in your data set is the
weight variable, which you can almost always find out by doing a PROC CONTENTS. If all else fails, you can read the codebook or
other documentation for the survey. Then just include that variable on your WEIGHT statement every time you do any kind of
statistical analysis with SAS.
THE MODE AND CATEGORICAL VARIABLES
The mode is one measure of central tendency, that is the “center” or average of a distribution. It is easy to see that the mode is
“white”. The mode, the most common score, is the only measure of “average” that makes sense when you are using categorical
variables. The distinction between categorical and numeric variables is an important one in statistics. A categorical variable is one that
differs only in quality, not quantity. A person can’t have “more” or “less” race. You can’t make ratio comparisons and say, “Juan is
twice as much ‘other’ as Tanisha.”
(For the SAS code to produce this table see the previous paper (De Mars, 2011b) .)
If we were going to use this graph and talk about the “average American” , we would say the average American is white.
EXAMPLE 2: USING SAS AND THE AMERICAN COMMUNITY SURVEY TO TEACH ABOUT DISTRIBUTIONS
Answer this question and see if you did as well as the average urban middle school student. Out of 192 countries rated by the
CIA Fact Book in income equality, where 1 = the most equal country in the world and 192 = the least equal country, where does the
United States rank? Are you ready with your number? Below is our distribution of household income in the United States. This is a
classic skewed distribution. When you have a skewed distribution, whether of income or anything else, you have most of the
population at one end and then a very long “tail” going off in the other direction. This describes income distribution in the U.S.
perfectly.
Below is a graph of income distribution with all of the population making over $500,000 a year lumped together. It has the
advantage of being able to read the numbers, but the disadvantage of masking how skewed household income really is.
The answer to the question is 92. The United States is smack in the middle in terms of income equality, or inequality, depending
on how you want to look at it. (This is the point where, with students, we could try different means of sorting the data, rounding to the
nearest $50,000, lumping all of the people making over $250,000 together, and looking at how it affected the picture of our
distribution. ) Our SAS code for the charts above is shown below.
SAS CODE FOR FREQUENCY DISTRIBUTIONS FROM THE AMERICAN COMMUNITY SURVEY
PROC FREQ DATA = hous.hus9 ;
TABLES hincp / OUT = meddist;
WHERE ten > '0';
WEIGHT wgtp ;
ODS GRAPHICS ON ;
DATA graphinc ;
SET meddist ;
INCOME = ROUND(hincp,10000) ;
IF hincp < 500000 THEN
Household_Income = income ;
ELSE Household_Income = 500000 ;
PROC FREQ DATA = graphinc ;
TABLES Income Household_Income ;
WEIGHT COUNT ;
The benefits of creating an output dataset and using it for analysis were discussed in an earlier paper in this series (De Mars,
2011a). The first PROC FREQ step outputs the distribution of household income to a dataset named meddist. The WHERE statement
only selects households where the variable ten (for tenure, as in, how long you have lived in this house) is greater than 0. The
WEIGHT statement, which we’ll discuss more below, applies the appropriate weight.
The ODS GRAPHICS ON statement will produce graphics for statistical procedures.
In the DATA step, the variable income is created simply by using the ROUND function to round household income to the nearest
$50,000. Household_Income is the same as the income variable, except that all households making $500,000 a year or more are
lumped into a single category.
That’s it. There are no additional steps required to producing the graphs. Once you have ODS GRAPHICS ON, the graphics are
produced automatically for the procedures that follow until you turn graphics off.
MEASURES OF CENTRAL TENDENCY, OR “WHAT IS AVERAGE?”
There are three measures of central tendency, with the median and mean being the two most commonly used. The mean is
generally preferred because it takes into account every score in the distribution. As you probably remember from some math class or
other, to get the mean, you add up all of the numbers in a sample and divide by the number of people in the sample. In mathematical
terms this is:
∑ Xi / N
The median, on the other hand, is the midpoint of a distribution, the one that half of the people fall above and half fall below.
When you have a skewed distribution, the median is preferred as a measure of central tendency. To understand why, think of
this example. You have 21 people in a room, 20 of them are unemployed and have $0 in income. The second just had an IPO on his
technology company and earned $21,000,000 this year. The mean income of those 21 people is $1,000,000. Hurray! Unemployment
problem solved! Of course, in this case, the mean is thrown off by one person who is very extreme. A more accurate representation of
the whole group of twenty people would be the median, which in this case is $0. Whenever you have a distribution with some very
extreme scores (referred to as outliers) it is a better choice to use the median than the mean.
How do we get the mean and median with SAS? There are several procedures you can use, PROC UNIVARIATE, PROC
TABULATE, PROC MEANS or PROC SURVEYMEANS, not name just a few. Let’s get some actual data and try these procedures.
The results below were produced with PROC UNIVARIATE. It actually produced three tables but since we are discussing basic
statistics we’re going to look at only two. Table 4, of Weighted Basic Statistical measures, shown below, gives the mean, median, and
mode. The mean is $69,000 and the median is $50,000. What does that tell you? Think back to our example with the 21 people and
one person with $21,000,000. That one person, referred to as an outlier, really pulled up the mean of our distribution. This is exactly
what we have happening with income in the United States.
Table 4
Weighted Basic Statistical Measures
Location
Variability
Mean
69025.63
Std Deviation
Median
50000.00
Variance
Mode
0.00
Range
Interquartile Range
72846
5306585337
1777400
63200
The median household income, which is how much money everyone in the home gets put together is $50,000. That is, half the
households have more income than that each year and half have less. The mean income, which we would get if we added everyone’s
income and divided by the number of people, is $69,025. The mode, that is, the most common income is $0.
MEASURES OF VARIABILITY
The next table shows the maximum, minimum and selected percentiles of the distribution. The maximum household income for
our sample was $1,749,000. That’s the most any household reported. The 99th percentile does not mean you got 99 percent correct
Table 5
Weighted Quantiles
Quantile
Estimate
100% Max
1749000
99%
385000
95%
186500
90%
139400
75% Q3
88300
50% Median
50000
25% Q1
25100
10%
12000
5%
7600
1%
0
0% Min
-28400
on a test. It means that you are higher than 99% of the population, or another way to put it is that you are in the top 1%. So, the top
1% of households in America receive $385,000 a year or more. The top 5% have $186,500 a year, or more. If your income is more
than 50% of households in America, then you are making $50,000 a year. The fact that both the numbers are 50 is just a coincidence.
The fact that it is the same as the median is no coincidence. The median and the 50th percentile are the exact same thing. The
bottom 1% of households have an income of $0 and the very lowest income in the sample is -$28,400. That is actually accurate. You
can have a negative household income, for example, if you own a business and your business loses money that year. The range is
the difference between the minimum and maximum and if you subtract -$28,400 from $1,749,000 you get exactly our range of
$1,777,400. All of this combines with the graphs we saw above to support our conclusion that income in America is quite skewed if
you can take a sample of 1% of the population and get a range of $1.8 million from the highest to the lowest.
Since we’re back looking at Table 4, let’s discuss the standard deviation, which is the average amount by which people differ
from the mean. The standard deviation is $72,846. That’s a pretty large number. How can that be correct? The variance is
$5,306,585,337 , which is a national debt size number. How can that be? Well, because the formula for the variance is:
_
2
∑ (Xi - X) / (n - 1)
In plain English, - that sideways W thing is the Greek letter sigma, or S and meaning “sum of” to all mathematician types
everywhere. The Xi denotes “each individual’s score”. So, Xi is the score for the ith individual. X1 is for the first person, X2 for the
second person and so on. The X with the bar over it is the symbol for the mean, which in this case is $69,025.63 . N stands for the
number of people in our sample. Because we have the sample, and not the actual population, we need to divide by (n- 1). Truly,
whether we divide by 3,030,727 or 3,030,728 is not going to make the slightest bit of difference but we statisticians like to be precise
about things. So, in English, the formula for the variance is this
“Take the sum of the squared differences from the mean and divide by the sample size minus one.”
When you have very large differences from the mean, say $1,749,000 - $69,0256 and you square these you get very large
differences squared. One million squared is one trillion.
The problem everyone has with the variance is exactly that, it is squared, so it’s not on the same scale as the mean income.
After all, we’re interested in how much the average person’s income differs from the mean, not how much is the squared difference
from the mean. So, we take the square root of the variance, and that gives us the standard deviation, the formula for which is shown
below. The square root of $5,306,585,337, by the way, is $72,846 which, not coincidentally, is the exact value shown for the standard
deviation in Table 4.
All of this, the standard deviation, which says the “average” difference from the mean is large, the skewed picture we saw in the
histogram, the difference between the mean and the median, all of this comes together to point to a clear picture. We have very
unequal distribution of income in America.
SAS CODE FOR PROC UNIVARIATE FROM THE AMERICAN COMMUNITY SURVEY
PROC UNIVARIATE DATA = hous.psam09 VARDEF = WGT ;
VAR hincp ;
WEIGHT wgtp ;
The PROC UNIVARIATE statement calls the univariate procedure, which produces, you guessed it, univariate statistics. The DATA =
option specifies the data set. The VARDEF = WGT option is very important in this case. This specifies that SAS will use the sum of
weights for the denominator of the variance. If you leave it off, you will get the correct mean but your variance, and the standard
deviation, which is the square root of the variance, will be wrong.
The VAR statement specifies the variables for which we want univariate statistics. In this case, there is just one, household income.
The WEIGHT statement gives the weight variable. If you leave it off, your mean, median and percentiles will almost certainly be
wrong.
EXAMPLE 3: USING SAS AND THE AMERICAN COMMUNITY SURVEY TO TEACH BI-VARIATE ANALYSES
MEASURES OF CENTRAL TENDENCY OF INCOME BY RACE, USING GRAPH-N-GO
Up to this point we have been discussing univariate analyses, that is analysis of one variable at a time, looking at household
income or race. Let’s move on now to bi-variate analyses, which are simply looking at two variables at a time. We saw, in our first
example, that there are differences by race in the likelihood of being in the sample, which is why we needed the weights. We saw that
there is an unequal distribution of income, which is very skewed. Do you suppose there is a difference by race?
To answer the question of whether race matters in income, I used another open data set, the American Community Survey
data for the state of California. This graph shows the mean income by race. Although the census data says that Hispanics can be of
any race, I noticed when I analyzed the data for California that many more people than in the nation as a whole put “Other” for their
race and almost all of those people were Hispanic. Although the Census Bureau does not consider Hispanic to be a race, many
Hispanics clearly did. So, I broke the data down into the four largest groups in California, Asian, Black, Hispanic, Other and White.
Then, I looked at the mean income for each group. This is not household income, but rather personal income. As you can see, the
answer to the question, “Does race matter?” is clearly, “Yes.”
We just discussed the fact that median is a better measure than mean for skewed distributions. Maybe there are some really, really
rich white people in California that are pulling off the mean. To test if this was the case, I re-ran the analysis but this time I selected
the median personal income by race. This gave me the following chart. As you can clearly see, there are still very large differences in
race, although everyone’s income is lower.
PRODUCING THE CHARTS OF INCOME BY RACE
First of all, you should note that Graph-N-Go hates large data sets. I tried opening a data set with several hundred thousand records
and it ended up crashing. To prevent this from happening, first create a dataset using PROC SUMMARY for means and median.
Because Graph-N-Go uses whatever formats and labels are stored with the data set, use a DATA step to define these.
PROC SORT DATA = lib.california ;
BY RACE ;
PROC SUMMARY DATA = lib.california MEAN MEDIAN ;
VAR income ;
BY race ;
OUTPUT OUT = examp MEDIAN = median_income MEAN = average_income ;
WEIGHT pwgtp ;
DATA examp ;
SET examp ;
LABEL Average_Income = "Income"
Median_income = "Median Income" ;
FORMAT average_income median_income DOLLAR8.0 ;
The statements above will create a data set with variables, median_income and average_income, with one record for each race.
The SORT step sorts the data set by race and it must be sorted or the next step will give you an error.
The PROC SUMMARY step is identical to PROC MEANS except that the default is not to produce printed output. The MEAN and
MEDIAN options in the PROC SUMMARY statement request that these two statistics be calculated.
The VAR statement specifies the variables for which you want these statistics, in this case it’s only income.
The BY statement requests the statistics by the variable(s) specified, in this case, it’s only income.
The OUTPUT statement names the output data set , the statistics to be written out to the data set and names for those statistics. Pay
attention here. You’d think just because you specified MEDIAN, for example, in the PROC SUMMARY statement that, obviously,
median should be written out to the data set. You would be wrong.
The DATA step applies the formats and labels I want used for the chart.
Once the data set is created, go to Graph-N-GO, in the SOLUTIONS menu under REPORTING.
To create your chart, drag the BAR GRAPH icon over to the graphing window, right-click on the empty box and select PROPERTIES.
Select the DATA MODEL to use (this is the output data set you created above, in this case work.examp), select the CATEGORY, in
our case we want race. Select the RESPONSE variable, which is income. Select the statistic.
For click-by-click directions on how to use Graph-N-Go see De Mars (2010).
The selection shown in the figure above produced our first graph, of mean income.
There is one little catch here. Graph-N-Go doesn’t have the choice for a “Median” statistic. How did we get the second graph, of
median income? Well, remember that the median statistic was actually created in our PROC SUMMARY step above. In fact, there is
only one record for each race. So, when I select “AVERAGE”, it is really just going to show the value of that one number. I could have
selected the SUM statistic and it still would have given me the same number.
EXAMPLE 4: USING SAS, JMP AND THE AMERICAN COMMUNITY SURVEY TO STUDY INCOME & RACE
Everyone’s income is much lower when I used the median. This brings up a really, really important point in statistics. You should
always know who your population is. Why is the median income so low? In this graph, I have included everyone in the state and
compared the incomes by race. Should I have included everyone? What about people under age 16 or over age 65? They won’t be
working, will they? It is a fact that Hispanics are significantly younger than the non-Hispanic population, African-Americans are
significantly older than the white population. Could the differences in income be due to differences in age?
To answer this question, I used JMP to create a chart of income by age for race.
So, the answer to the question, “Does race matter?” is “Yes.” The answer to the question of, “Can this be due to age?” is “Not entirely,
for sure, because when you control for age, the differences still persist.
You can see that the curves for each race are somewhat similar. Before 16, no one is making any money. From age 16 to 30 -50
(depending on the race), income goes up. Then, around age 60, income starts to drop as people retire. You can see that for whites
and Asians the curve goes up more steeply than for the other three groups. Also, you can see that even if you control for age, the
incomes for whites and Asians are higher.
What else could explain the difference in income? Yes, racism is one answer, but are there others? To answer this question, we take
a look at one more graph, also created with JMP. This is the same as our previous chart but we’re going to look mean income by
education. What we can clearly see is that much, although not all, of the difference in income disappears when you control for
education. Once they have an MD, law degree of Ph.D., African-Americans and Asian-Americans make about the same amount.
Non-Hispanic whites still make more than other groups, at every level of education, but the differences are greatly reduced.
What about that little drop down at the end, that shows for all racial/ ethnic groups except for Hispanics that Ph.Ds make less than
those with MDs and law degrees? My suspicion is that there are a few people making millions who are impacting the mean.
(Remember our discussion about means and medians?)
The chart below shows median income by education and I can see that there is a straight-line relationship. The more education you
get, the more money you make.
BI-VARIATE GRAPHS BY RACE USING JMP
Graphics with Graph-N-Go are easy but distinctly limited. SAS offers several other options for graphics. One that combines ease and
flexibility is JMP. That’s the good news. The bad news is, if you want to do much programming, you have to learn a whole new
language called JSL. OR ... you could use SAS to create your data set to analyze, as in all of the other examples above.
PROC SUMMARY DATA = lib.california MEAN MEDIAN ;
VAR income ;
BY race age ;
OUTPUT OUT = incomerace MEDIAN = median_income MEAN = average_income ;
WHERE age > 15 ;
All that we have added to the previous example is the WHERE statement. To export a file as a JMP data set, simply select from the
FILE menu EXPORT DATA, point and click through the menus to select the data set to export (in this case, incomerace ) and then
select JMP file from the drop-down menu as the type of data set to export.
If your organization does not have JMP licensed, two other options for examining statistics graphically are SAS Enterprise Guide and
SAS/GRAPH. Of the two, SAS/ Graph is far more flexible but SAS Enterprise Guide has a much gentler learning curve.
EXAMPLE 5: TESTS OF SIGNIFICANT DIFFERENCE USING OPEN DATA ON OLYMPIC SPORTS PROGRAM
So far, we have been using “descriptive statistics”. As the name implies, descriptive statistics simply describe what we
observed. When we start to make inferences about the population as a whole, we are moving into the realm of “inferential statistics”.
A main focus of inferential statistics is the determination of whether or not a result is significant.
What significance means to a statistician is not “very important”. In any two groups, it is not at all unexpected to have some
difference occur completely randomly. Say that in the population, the means are exactly equal in two groups. Still, on any given day,
some of the females would have stayed sick, would have not turned in their paperwork in time to be allowed to compete, not had
money to travel to the event or for hundreds of other possible reasons, missed the competition. Males, too, would get sick, lose
paperwork, not have money and so on. The result is that rarely are samples from two groups exactly equal, even when, if we had
everyone in the population, we’d find that the group means are equal? What’s a poor statistician to do? The answer is that we
calculate a test statistic and then find the probability of getting a statistic that large if the true difference in the population is zero.
Many of the most common statistics, also called parametric statistics, make an assumption of a normal distribution. That is, that the
distribution is not terribly skewed. As we have already seen, at length, income doesn’t fit the assumption of normality at all. There are
ways to get around this assumption, but the easiest, since we are just learning here, is to select a data set that does fit the
assumption.
The Census Bureau is not the only source for open data, nor are all open data sets extremely large. Many smaller non-profits are
eager to have data analyzed to answer questions of interest to them. This next data set is of athletes competing in judo in the U.S.
national championships from 1990 - 2011.
Below we take a look at the frequency distribution of the number of male athletes competing during this period and what we see is a
very nice, normal distribution. One characteristic of a normal distribution is that the mean = the median = the mode. As you can see in
the graph below, the mode occurs right around 220, the mean is 222, and the median falls in that same interval around 220
competitors.
In a normal distribution, 95% of the population will fall within two standard deviations of the mean. Also, normal distributions
are symmetrical, with observations occurring above the mean and below the mean with equal frequency.
SAS CODE FOR THE HISTOGRAM
Producing the histogram above was a piece of cake. I simply used this statement:
ODS GRAPHICS ON ;
PROC FREQ data = athletes ;
TABLES competitors ;
WHERE sex = “Males” ;
If you have yet to try ODS statistical graphics, I highly recommend you give it a look. All you need to do is include the statement ODS
GRAPHICS ON , before your statistical procedures and SAS will automatically produce the most commonly requested graphics, with
no additional programming required.
T-TEST PROCEDURE FOR TESTING FOR DIFFERENCE BETWEEN MEANS
The first question of interest to our non-profit organization is whether there is a significant difference between the numbers of
competitors in the male and female divisions each year. To test for the difference in means between two groups, we compute a t-test.
A t-test will give four tables of results. The first one is shown below.
Table 6
First PROC TTEST Table
sex
N
Mean
Std Dev
Std Err
Minimum
Maximum
Female
22
97.0909
15.8201
3.3729
66.0000
119.0
Male
22
222.8
45.2253
9.6421
146.0
324.0
-125.7
33.8792
10.2150
Diff (1-2)
There were 22 records for males and 22 for females. The mean number of competitors each year was 97 for females, with a standard
deviation of 15.8 , with a range from 66 to 119. For males, the mean number of competitors was almost 223 per year, with a standard
deviation of 45, and a range from 146- 324. What exactly is a standard deviation mean? A standard deviation is the average amount
by which observations differ from the mean. So, if you pulled out a year at random, you wouldn’t expect it to necessarily have exactly
222.8 male competitors. In fact, it would be pretty tough on that last guy who was the .8! On the other hand, you would be surprised if
that year there were only 150 competitors, or if there were 320. On the average, a year will be within 45 competitors of the 223 male
athletes and 95% of the years will be within two standard deviations, or, from 132 to 312. That is, 223 - ( 2 x 45) to 223 + (2 x 45) .
WHAT IS THE STANDARD ERROR AND WHAT DETERMINES IT?
But what is the standard error? The standard error is the average amount by which we can expect our sample mean to differ from the
population mean. If we take a different sample of years, say, 1988- 2009, 1991- 2012, all odd numbered years for the last 30 years
and so on, each time we we’ll get a different mean. It won’t be exactly 97.09 for women and 222.8 for men. Each time, there will be
some error in estimating the true population value. Sometimes we’ll have a higher value than the real mean. Sometimes we’ll
underestimate it. On the average, our error in estimate will be 9.64 for men, 3.37 for women. Why is the standard error for women
lower? Because the standard deviation is lower.
The standard error of the mean is the standard deviation divided by the square root of N, where N is your sample size. The square
root of 22 is 4.69. If you divide 15.82 by 4.69, you get 3.37. Why the N matters seems somewhat obvious. If you had sampled several
hundred thousand tournaments, assuming you did an unbiased sample, you would expect to get a mean pretty close to the true
population. If you sampled two tournaments, you wouldn’t be surprised if your mean was pretty far off. We all know this. We walk
around with a fairly intuitive understanding of error. If a teacher gives a final exam with only one or two questions, students complain,
and rightly so. With such a small sample of items, it’s likely that there is a large amount of error in the teacher’s estimate of the true
mean number of items the student could answer correctly. If we hear a survey found that children of mothers who ate tofu during
pregnancy scored .5 points higher on a standard mathematics achievement test, and then find out that this was based on a sample of
only ten people, we are skeptical about the results.
What about the standard deviation? Why does that matter? The smaller the variation in the population, the smaller error there is going
to be in our estimate of the means. Let’s go back to our sample of mothers eating tofu during pregnancy. Let’s say that we found that
children of those mothers had .5 more heads. So, the average child is born with one head, but half of these ten mothers had babies
with two heads, bringing their mean number of heads to 1.5. I’ll bet if that was a true study, it would be enough for you never to eat
tofu again. There is very, very little variation in the number of heads per baby, so even with a very small N, you’d expect a small
standard error in estimating the mean.
The second table produced by the TTEST procedure is shown in Table 7 below. Here we have an application of our standard error.
We see that the mean for females is 97, with a 95% Confidence Level (CL) from 90.07 to 104.1. That 95% is the mean minus two
times the standard error, plus two times the standard error. That is, 97.09 - (2 x 3.37) to 97.09 + (2 x 3.37).
Why does that sound familiar? Perhaps because it is exactly what we discussed on the previous page about a normal distribution?
Yes, errors follow a normal distribution. Errors in estimation should be equally likely to occur above the mean or below the mean. We
would not expect very large errors to occur very often. In fact, 95% of the time, our sample mean should be within two standard errors
of the mean.
Table 7
Second PROC TTEST Table
sex
Method
Female
Mean
97.0909
Male
95% CL Mean
90.0766
Std Dev
95% CL Std Dev
104.1
15.8201
12.1713
22.6080
222.8
202.7
242.8
45.2253
34.7941
64.6299
Diff (1-2)
Pooled
-125.7
-146.3
-105.1
33.8792
27.9348
43.0609
Diff (1-2)
Satterthwaite
-125.7
-146.7
-104.7
The next two lines both say Diff (1-2) and both show the difference between the two means is -125.7. That is, if you subtract the
mean for the number of male competitors from the mean number of female competitors, you get negative 125.7. So, there is a
difference of 125.7 between the two means. Is that statistically significant? How often would a difference this large occur by chance?
To answer this question we look at the next table. It gives us two answers. The first method is used when the variances are equal. If
the variances are unequal, we would use the statistics shown on the second line. In this instance, both give us the same conclusion,
that is, the probability of finding a difference between means this large if the population values were equal is less than 1 in 10,000.
That is the value you see under the PRobability > absolute value of t.
If you were writing this up in a report, you would say,
“There were, on the average 126 fewer female competitors each year than males.
This difference was statistically significant (t = -12.30, p <.0001).”
Table 8
Third PROC TTEST Table
Method
Variances
Pooled
Equal
Satterthwaite
Unequal
DF
t Value
Pr > |t|
42
-12.30
<.0001
26.064
-12.30
<.0001
In this case the t-values and probability are the same, but what if they are not? How do we know which of those two methods to use?
This is where our fourth, and final table from the TTEST procedure comes into use. This is the test for equality of variances. The test
statistic in this case is the F value. We see the probability of a greater F is < .0001. This means that we would only get an F-value
larger than this 1 in 10,000 times if the variances were really equal in the population. Since that is a really large number, and the
normal cut-off for statistical significance is p < .05 and .0001 is a LOT less than .05, we would say that there is a statistically
significant difference between the variances. That is, they are unequal. We would use the second line in Table 8 above to make our
decision about whether or not the differences in means are statistically significant.
Table 9
Fourth PROC TTEST Table
Equality of Variances
Method
Folded F
Num DF
Den DF
F
Value
Pr > F
21
21
8.17
<.0001
SAS CODE FOR THE TTEST PROCEDURE
PROC TTEST DATA = athletes ;
CLASS sex ;
VAR competitors ;
PROC TTEST requests the t-test be performed using data from the specified data set.
The CLASS statement gives the variable that identifies the two groups being compared. There must be two groups. No more, no less.
The VAR statement gives the variable that is being compared. There can be as many variables as you like, but it must be numeric,
since you are comparing the means.
EXAMPLE 6: TESTS OF LINEAR TREND USING OPEN DATA ON OLYMPIC SPORTS PROGRAM
One last question our non-profit organization would like to know. They have data showing that the difference between male and
female competition has been declining over time. However, they suspect that this is not due to an influx of female competitors as
much as a decline in male competitors. SAS offers several procedures to test for a linear trend, including PROC CORR, PROC GLM
and PROC REG. PROC GLM and PROC REG will always give the identical results in terms of statistical significance, size of
relationship. PROC CORR is primarily used only to compare two variables, and in case such as this one, will also give the identical
result with GLM and REG.
In this case, I used PROC REG. It provides several tables and charts, but there are only a few that are really needed to answer the
question. First, is there a linear trend? In a word, yes. In Table 10 we see the parameter estimates produced by the REG procedure.
Table 10
Parameter Estimates from PROC REG
Parameter Estimates
Parameter
Estimate
Standard
Error
Variable
DF
Intercept
1
10627
year
1
-5.20102
Standardized
Estimate
t Value
Pr > |t|
2072.01102
5.13
<.0001
0
1.03574
-5.02
<.0001
-0.74678
We are really interested in the variable year, and we can see that it has a significant relationship with the dependent variable, which is
the number of male competitors, with a probability of getting a greater t-value of less than .0001. The parameter estimate of -5.201
tells us that every year, we are predicted to have 5.2 fewer male competitors. Because the parameter estimate is negative, as the
year value goes up, the number of male competitors goes down. The standardized estimate, shown in the last column, tells us that for
every standard deviation increase in year, there is a .75 standard deviation drop in the number of competitors. The standard deviation
for year is 6.5, so every 6.5 years, there will be 33 fewer male competitors (.75 * the standard deviation of 45).
In this particular case, the standardized estimate isn’t really any more useful than our original parameter estimate, but that will not
always be so. Assume we had two variables, one is year and the other is dollars in prize money available to players winning
international medals. That could be from $0 to $40,000. If the parameter estimate for dollars is .052 and the estimate for year is -5.20,
does that mean that the trend by year is 100 times more important than the amount of prize money available? After all, the number of
competitors is going down 5.2 for every year and only going up .052 for every dollar in additional prize money. A reasonable person
might argue that doesn’t make sense, there is a lot more variation in the number of dollars than the number of years. You can’t
compare the two directly. You can’t. This is where the standardized estimate would be used.
PROC REG also produces several graphs. The fit plot, shown below, plots a linear trend. The blue bands are the 95% confidence
interval. We can see that several points fall outside of that interval. Those points spaced above and below the blue bands are about
equal width apart. Generally, you would not see as many points falling outside the confidence bands. Even though the relationship is
statistically significant and the standardized estimate is quite high, there seems to be more of a cyclical trend. In discussing this with
the historian from the organization, he identifies the high points as coming prior to Olympic years, when athletes are attempting to
qualify for the Olympic games, and the low point as coming just after the Olympics, when many athletes decide to retire from
competition. As for the high point in 1995, this was not only a pre-Olympic year, but the national championships were held in Hawaii
and “Everyone likes to go to Hawaii.” This ability to spot cyclical trends and outliers is one benefit of ODS statistical graphics.
Combined with the statistical tables, graphics can give a fuller picture.
SAS CODE FOR THE TTEST PROCEDURE
ODS GRAPHICS ON ;
PROC REG DATA = athletes ;
WHERE sex = “Male”;
MODEL competitors = year / stb ;
CONCLUSION
Whether learning statistics for the first time, or presenting statistical data to a group who, at least initially, have no more interest in the
subject than the average hamster, the first exposure to statistics can be a challenge. This challenge is increased by the desire of most
learners to analyze “real data” , to see real world applications of statistics. The good news is that some of this real world data actually
comes with “answer keys” that allow novices to check their results against published statistics. The better news is that, while learning
statistics, these data can be useful to help and inform their community. Several examples have been given in this paper. With the
wealth of open data sources available, the only limit is the programmer’s time and creativity.
The even better news is that coding the procedures using SAS is usually the easiest part of the process by far. Use of graphics
options such as Graph-N-Go, ODS statistical graphics and JMP can provide a broader view to show that, with the latest software,
statistics are more than just numbers and to give a bigger, fuller picture - literally.
REFERENCES
Alda, A. (2007. Things I overheard while talking to myself. Random House.
Dehaene, S. (2011). The Number Sense: How the Mind Creates Mathematics. Oxford University Press.
De Mars, A. (2010). From Novice to Intermediate in (Approximately) Sixty Minutes: III. Presentation. Paper presented at the
annual meeting of Western Users of SAS Software. San Diego, CA.
®
De Mars, A. (2011a). SAS Functions for a Better Functioning Community. Paper presented at the annual meeting of Western
Users of SAS Software. San Francisco, CA.
De Mars, A. (2011ab. SAS Essentials II: Better-looking SAS for a better community. Paper presented at the annual meeting of the
Western Users of SAS software.
Hersh, R. & John-Steiner, V. (2011). Loving and hating mathematics: Challenging the myths of mathematical life. Princeton, NJ :
Princeton University Press.
U.S. Census Bureau (2009). ACS 2009 1-Year PUMS File Readme I.) Overview of the Public Use Microdata Sample files (PUMS).
http://www.census.gov/acs/www/data_documentation/pums_documentation/
ACKNOWLEDGMENTS
Thank you to Kirby Posey of the U.S. Census Bureau for invaluable assistance in verifying the variable coding and estimates.
Thanks also to Jerry Hays, United States Judo Federation historian for provision of historical data on competitors in Olympic weight
divisions.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
AnnMaria De Mars
The Julia Group
2111 7th St. #8
Santa Monica, CA 90405
(310) 717-9089
[email protected]
http://www.thejuliagroup.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the
USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.