Download Demonstrations II: Measures of Central Tendency and Dispersion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Demonstrations II
Dr. Scott Stevens
N.
O.
P.
Q.
R.
S.
Arithmetic mean (Average)
Median (and comparison with the mean)
Box plots (modified and unmodified), quartiles, and IQR
The mean, from grouped data
Comparison of measure of central tendency: mean, median, mode, midrange, midhinge
Standard deviation and variance
1
N. Arithmetic Mean (Average)
[For the mean from grouped data, see Q.]
Problem: During the 1980 presidential campaign, Ronald Reagan repeatedly asked voters if they
were better off in 1980 than they were 4 years before. Here are some data (in yellow) on the
unemployment and inflation rates during the Carter Administration (’77 to ’80) and the Reagan
Administration (’81 to ’88). Is there any difference between the average unemployment rate in the
two administrations? Average inflation rate?
This is quite easy to compute by hand. Assuming that by "average" the problem means "mean", we just
take the numbers to be averaged, add them, then divide by the number of numbers added. We can also do
this with the Excel =AVERAGE(range) function.
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Year
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
B
Unemployment
7.1
6.1
5.8
7.1
7.6
9.7
9.6
7.5
7.2
7
6.2
5.5
Carter Average
Reagan Average
6.525
7.5375
C
Inflation
6.7
9
13.3
12.5
8.9
3.8
9.6
7.5
7.2
7
6.2
5.5
10.375
6.9625
So the Carter unemployment average, for example, was computed as = AVERAGE(B2:B5), since the
unemployment figures for the Carter years were in spreadsheet cells B2 through B5.
The result show that unemployment was lower during the Carter years, but that inflation was considerably
lower during the Reagan years. The question that remains is whether the differences in these figures are
too large to be credited to random fluctuation alone. That question is one we take up later in the semester.
It’s important to note that the average, a single number, cannot tell the whole story of a data set. Be careful
what conclusions you draw from a “one number summary”!
2
O. Median (and comparison with the mean)
Problem: Below is a table of the rainfall recorded in the Los Angeles area in the last 10 years.
(Unlike most of these demonstrations, I’ve made up this data for this one.) Compute the median
rainfall based on these data. Without computing the mean, state whether the mean rainfall for this
ten year period would be above, below, or equal to this median value.
rainfall in a year
3
median rainfall
4
5
5
5
6
6
14
17
5.5
The first step is to sort the data, smallest to largest –I’ve already done this with the data. Then, in a set
of 10 observations, the median should be observation number (10+1)/2 = 5.5…that is, the number
halfway between observation #5 and observation #6. Since #5 is 5” and #6 is 6” in this case, the
median is 5.5”. You can also get this from Excel, by using the =MEDIAN(range) command.
Now, how does this compare to the mean?
If you imagine these bars sitting on a see-saw along the number line of the x-axis, the mean would be
the balance point of the see-saw. (Think of where you’d have to put the fulcrum (or balance point) of
the see-saw to make the “kids” balance. You should be able to see that it’s somewhere near where I
put it below, at 8.5. The three little kids on the right side at positions 14, 17, and 20 will just balance
the two small and two big kids on the left side (at positions 3, 4, 5, and 6).)
4
Big kid at the "5" mark
weighs 3 units
3
2
Little kid way out
at the "20" mark
weighs 1 unit
1
0
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Balance point
3
20
On the other hand, the median point for a distribution is the point at which half of the "stuff" is to the
left and half of the "stuff" is to the right. If we stick with our see-saw imagery, "stuff" means "weight",
and I hope that you can see that 5 units of weight lie to the left of 5.5 (the median value), and 5 units of
weight lie to the right of 5.5
This way of thinking about mean and median is quite useful, so take some time to lock it down. What
you can learn from this example holds in general. Note, for example:

If the last kid slides from position 20 to position 25 on the see-saw, the balance point would
shift to the right (a higher mean), but the median ("half-weight") point would remain
unchanged. In general, the mean is more affected by extreme values than then median.

If the distribution has a long "tail" in one direction (like this one does to the right—it is
"skewed right"), then the mean tends to be more than the median. In general, if a distribution
is unimodal (one highest point) and skewed, then as you run up the long slope of the "hill",
you'll encounter the mean, median, and mode, in that order. The mode, of course, will be at
the top of the hill.

If the distribution is symmetric ("mirror imaged, left to right"), then the mean and median are
equal. If the symmetric distribution is also unimodal (one highest point), then the mode is
also equal to the mean and the median.
Don't memorize these results—understand them. And think about what the measures given in a given
problem really tell you. For example, suppose you’re told that the average number of people in a US
household is 2.4 persons. This is the mean (since it clearly isn't the median or mode!). It is the "balance
point" of the population distribution. Saying it another way—if all of the people in all of the households
were gathered together, and then distributed evenly among all of those households…well, we'd have a
bloody mess, since each household would get 2.4 people.
But that doesn't mean that 2.4 is "typical", or even close to "typical". For example, if 80% of all
households have 1 person, and the remaining 20% all had 8 persons, the average would be 2.4.
The mean is a measure of "middle", but we almost always need a measure of "spread", as well. The most
common is standard deviation.
P. Box Plots (Modified and Unmodified), Quartiles, and IQR
Problem: In the yellow box on the next page, find the annual inches of precipitation at the Los
Angeles Civic Center for the years 1961 to 1990. Summarize this data with a boxplot and modified
box plot.
It's not hard to do this work by hand, but the goal of this course is to give you tools that you can use
effectively and responsibly. To help you with this, I've written a number of spreadsheet templates that will
perform common statistical tasks, to supplement those functions already a part of Excel. Sometimes, my
templates duplicate functions already available in Excel. When I do this, it is for one of two reasons.
Either 1) Excel's built-in function is restrictive about how the input data must be supplied, or 2) Excel's
built-in function is not helpful in understanding how the answer is obtained. Since you must understand a
tool clearly in order to use it effectively, my template often provide a "step-by-step" approach.
I'll also be providing templates to perform some of the tasks that Excel can't do at all, or can only do
incompletely. That's what I've done for this example. The template (available at my website) is
Frequency Distribution, Histogram, and Box and Whisker Plot. Before you start using my templates,
though, there are some general things you should know about them. Please check the website post entitled
Using Stevens’ Statistical Templates: Useful Information. It uses this problem as an example.
4
Data
4.56
5.83
6.49
6.54
7.58
7.98
8.9
8.92
9.11
9.26
9.98
10.7
10.92
11.01
12.31
12.91
14.41
14.97
15.37
16.54
16.69
17
17.45
18
23.66
26.32
26.33
26.81
30.57
34.04
Outlier?
Statistics
mean, x-bar
smallest entry
largest entry
number of
observations
median
1st quartile
3rd quartile
IQR
lower whisker
upper whisker
suggested # of cats
category size
At least…
4
9
14
19
24
29
34
Values
14.70533333
4.56
34.04
Formulas
=AVERAGE(range)
=MIN(range)
=MAX(range)
30
=COUNT(range)
12.61
8.9675
17.3375
8.37
4.56
29.8925
=MEDIAN(range)
=QUARTILE(range, 1)
=QUARTILE(range, 3)
= 3rd quartile - 1st quartile
= 1st quartile - 1.5 * IQR or smallest val
= 3rd quartile + 1.5 * IQR or largest val
Frequency Table for Histogram
5
5
…but less than…
frequency
9
8
14
8
19
8
24
1
29
3
34
1
39
1
outlier
outlier
We compute the numbers needed for a modified box plot and unmodified box plot. Let's start with the
modified box plot. Note the commands that Excel uses to find median, 1st quartile, and 3rd quartile. The
interquartile range (IQR) is just the difference between the first and third quartile.
Formulas for first and third quartile. Different sources compute the quartiles slightly differently. Excel
computes the first quartile as observations number (n +3)/4 in the sorted list of n observations, while your
text uses observation number (n+1)/4. The median is observation (n+1)/2, as in your book. The third
quartile is computed in Excel as observation number (3n+1)/4, while your book uses observation number
(3n+3)/4. We’ll be happy with either of these calculation rules. (Here, for example, the first quartile turns
out to be observation (30 + 3)/4 = 33/4 = 8.25. What is observation number 8.25? It's ¼ of the way from
observation #8 to observation #9 on the sorted list. #8 is 8.92 and #9 is 9.11. You can find the number that
is ¼ of the way from A to B by computing (0.75  A) + (0.25  B). So, for our data, this is 0.75(8.92) +
0.25(9.11) = 8.9675, as reported.) The Excel command for the first quartile is, as you can see, =
QUARTILE(range, 1). The third quartile replaces the "1" with a "3": =QUARTILE(range, 3).
In the modified box plot, the lower whisker extends down to 1.5 * IQR below the 1 st quartile, and the upper
whisker extends up to 1.5 * IQR above the 3rd quartile. Any data points beyond the whisker's ends are
marked with dots, and identified as outliers. With the unmodified box plot, the whiskers extend all the
5
way to the most extreme data values—the maximum and minimum observations. My spreadsheet here
computes the "whisker's end" values for both cases, and I’ve provided two different spreadsheet templates
on the website to create the two kinds of box plots. For this course, you’ll be responsible only for creating
the unmodified box plots.
Modified Boxplot
0
10
20
30
Unmodified Boxplot (same data)
40
0
Annual Rainfall (inches)
10
20
30
40
Annual Rainfall (inches)
Let's take a look at the modified box plot, since we can use it to talk about both types of box plots.
We can see that there are no outliers on the lower end; no observed rainfall is more than 1.5 IQRs from the
1st quartile, so the lower tail stops at the lowest observation. The central box shows the 1 st quartile,
median, and 3rd quartile. Remember what this means. 25% of all observations represent rainfall below the
"left wall" of this box (about 8.97"). 25% of all observations lie between the left wall and the median line.
25% more lie between this median line and the "right wall" of the box (about 17.34"). Finally, the highest
25% of the rainfalls fall to the right of the "right wall" of the box. To get a rough idea of what the
histogram for this data would look like, you can imagine dumping the same amount of water in each of
these four “compartments”. The “water” between the 1 st quartile and the median would be higher than any
other “compartment”, indicating that the numbers are crunched together there more than anywhere else.
Conversely, the numbers from the third quartile to the maximum value are “spread out”—they’re not
packed in to their interval as densely.
What else? We see that there are two observations that fall above the end of the upper whisker. We
identify these as outliers—both correspond to more than 30 inches of rain.
And the unmodified box plot? How does it differ? Only in that the upper whisker extends to include both
of the pink outliers. This is the only kind of boxplot that your book uses.
While I expect you to be able to create an unmodified box plot without needing my spreadsheet, it's
unlikely you'll be able to do it (without my help) in Excel. Excel doesn't support box plots, so I did a fair
amount of work to make it draw them, anyway. When you use my spreadsheet on other data, be sure to
change the axis name so that it fits your problem.
6
Q. The mean, from grouped data
[For the mean of ungrouped data, see N.]
Problem: In a random sample of 50 college students, 5 said that they sit “in the very front” of the
class and 21 said that they sit “toward the front”. The GPA of the students in the very front was
10.94 (on a 12 point scale) while for the students who sat “toward the front”, the average GPA was
9.38. What is the average GPA of all 26 of these students?
The answer has to come out exactly as if 5 students had GPAs of 10.94 (the average for the “front” group)
and 21 students have GPAs of 9.38 (the “near front” group). Computing this is easy: Average 26 numbers,
5 of which are 10.94 and 21 of which are 9.38. If you think for a moment, you’ll realize that the math is
going to look like [(5  10.94) + (21  9.38)]/(5 + 21). We can generalize this work to find the mean from
any set of grouped data.
Finding the Mean from Grouped Data in Excel
Excel (as well as a number of software packages) expects that, when you want to do statistics, you'll type in
every single data point. Sometimes, though, like in this problem, you don't really want to do that. You
want to enter the different values observed, and how many times each value was observed. Happily, Excel
can still easily compute the average of data presented in this way. Here's how you do it.
1. Enter your data in two columns, "value" and "frequency".
2. Compute the average of the data with the command
=SUMPRODUCT(valuerange, frequencyrange)/SUM(frequencyrange)
Here, valuerange refers to the cells containing the observed values (the numbers in the "value" column).
frequencyrange refers to the cells containing the number of times each value is observed (the numbers in
the "frequency" column).
You can also find the mean of grouped data from a relative frequency distribution. The formula is even
simpler:
=SUMPRODUCT(valuerange, relfreqrange)
where relfreqrange is the range of cells containing the relative frequencies of the observed values.
We'll use this here.
Location
front
toward front
mean
# of students Average GPA
5
10.94
21
9.38
9.68 =SUMPRODUCT(C2:C3,B2:B3)/SUM(B2:B3)
You could, of course, have typed the 26 numbers in separately, then used the =AVERAGE command.
The standard deviation and variance is computed from grouped data using the same idea used here: treat
every observation in a class as though it fell at the midpoint of that class.
7
R. Comparison of measure of central tendency: mean, median, mode, midrange, midhinge
See N for mean. See O for comparison of median and mean.
Mode The mode is simply the most frequently occurring value in a data set—what single observed value
occurs most often? Some data sets have no mode, since each observation occurs only once. Other data sets
are “bimodal”—they have two different modes. The term “bimodal” is often used to describe frequency
distributions of interval or ratio data that have two prominent, nonadjacent “peaks” of comparable size in
their histograms. The mode is of limited usefulness with ungrouped data. If more people got 67 points on
the exam than any other single number, how much does that really tell you? On the other hand, suppose we
view soda consumption, and consider classes of 0 to 5 ounces, 5 to 10 ounces, and so on. It might be quite
useful for the soda company to know that when test subjects were given soda to drink in one sitting, their
modal consumption was between 10 and 15 ounces. They may decide to market 12 ounce cans rather than
8 ounce ones.
In Excel, you can find the mode of a data set with the command = MODE(range), where range is the range
of cells containing the data.
Midrange (Not used in your book.) The midrange is the value halfway between the smallest and largest
observation in a data set. It can be computed in Excel as =(MAX(range) + MIN(range))/2. It’s of limited
usefulness, especially for skewed data. How could it be useful?
Suppose a policewoman is responsible for answering calls for help along a section of route 81 that includes
exits at mile markers 30.4, 31.4, 39.4, 39.8, 40.2, and 50.4. She wants to minimize the impact of a “worst
case scenario”—that she receives a call at one exit when she is far away from it. If this is her concern,
where should she station herself along the highway? Well, since her segment of road runs from marker
30.4 to marker 50.4, she should station herself at the midrange, 40.4. In this way, she is at most 10.0 miles
from the call, regardless of where it originates.
Midhinge (Not used in your book.) The distinction between the midrange and midhinge is the same as the
distinction between the range and IQR. (See L for information on IQR.) That is, the midhinge is the point
halfway between the first and third quartile values. In Excel, it can be computed as
= (QUARTILE(range,1) + QUARTILE(range,3))/2. It, too, is of limited usefulness. The feel of it is that it
gives you the geometric middle of the “middle half” of the data.
To concoct an example, suppose that our policewoman of the midrange example, above, decides that it is
foolish to treat the exit at 50.4 miles on an equal footing with the others. Her midrange position of 40.1 is
far from the “center” of things, due to the “outlier” at 50.4. She may instead decide to worry about the
“middle 50% of the exits”. With 6 exits, Excel’s formula for 1 st quartile gives observation (6+3)/4 =
“observation 2.25”, which is 33.4. The Excel formula for the 3 rd quartile gives observation (18+1)/4 =
“observation 4.75”, or 40.1. The midhinge, the point halfway between the values, is (33.4+40.1)/2 = 36.75.
By stationing herself at mile marker 36.75, she’ll be as close as she can be to the “middle 50%” of the exits.
(Note: your book uses slightly different formulae.)
You can relate median, midhinge,
and midrange rather nicely with the
unmodified box plot. Take a look.
The point marked M is, of course,
the median. The midrange is at the
location marked “R”, in the middle
of the range. The midhinge is at the
location marked “H”, in the middle
of the “box” made by the first and
third quartile values. (The arrows
emanating from R show the range.
8
The arrows emanating from H show the IQR.)
The location of the mean and the mode (if any) aren’t readily apparent from a boxplot. If the data is
unimodal and skewed, though, then as you come in from the long tail, you’ll always encounter first the
mean, then the median, and then the mode. (In this example, our data set is to small to sensible talk about
these things.)
S. Standard deviation and variance
We’ll have much to say later about the meaning and usefulness of standard deviation and variance. For
now, though, you should know that both are measures of dispersion, or spread---just like range and IQR.
In many problems involving samples, we’ll learn that standard deviation is THE most useful measure of
dispersion. Why? For us, one of the big reasons is an additive property of variance. I’ll return to this idea
in a minute, after we do a bit of variance work.
For now, though, we’re going to have to be happy with the definitions of these rather elusive quantities, and
our ability to compute them in Excel. I’ll refer you to your textbook for the formulas themselves. Here,
we’ll focus on meaning and Excel computation. But you have to be careful with these guys. Let me
explain.
If you wanted to figure out the average number of credits taken per semester by a JMU student, you could
take a random sample of, say, 100 students. You’d then add their credit loads, divide by the number of
students, and get a number—say 16.8 credits. Now it’s unlikely that if you had done the same calculation
with all JMU students to get the exact answer, the result would have been 16.8—it probably would have
been a little more or a little less. On the other hand, it should be close, and you’d figure your answer is as
likely to be a little too small as a little too large. All of this means that you can estimate the mean of the
population by taking the mean of a sample, and that (if the sample is large enough), the mean of the sample
should be a good estimate of the thing that you really care about, which is the population mean. The
calculations that you do to find the average of a sample and the average of the population feel like the same
calculation (add everything together and divide by how many you have). In a sense, though, they’re
different, because we don’t really care about the 100 randomly selected students. We only want to use the
average for that group to say something about the average for everybody. This is the only thing a sample is
good for.
But suppose you wanted to get an idea of the “spread” of the number of credits that JMU students take, as
measured by variance and standard deviation. The population of all JMU students would have a specific
number for its variance and for its standard deviation, and we could compute it using Excel (with the
VARP function), or the formula for 2 from your book. But if we only had a sample, and we tried to use
the same formula for its variance, we’d find that, more often than not, the value we’d get for the sample
variance would be too small to be a good estimate of 2. Just why this is so is something we’ll talk about
later, but the amazing fact is that the estimate you get from the sample is a good estimate if you change the
formula for sample variance (which we write s2) by just a little bit—instead of dividing by the number of
observations in the sample (which would seem to make sense), you divide by one less—n-1. So Excel has
two different functions that can be applied to a range of numbers to give a variance.
=VARP(range)
=VAR(range)
gives 2, the population variance
gives s2, the sample variance
In real life, the data that you get is almost always a sample from a larger whole, so the VAR formula is
generally more useful than the VARP formula.
9
The standard deviation, of course, is just the square root of the variance. We can compute these in Excel,
too.
=STDEVP(range) gives , the population standard deviation
=STDEV(range)
gives s, the sample standard deviation
We use standard deviation more often than variance because the standard deviation of a set of data is
measured in the same units as the data itself. So if you have data in gallons of water, the standard deviation
is measured in gallons, while the variance is measured in “gallons2”, whatever that means.
So why use variance at all? And why use such a freaky idea as variance OR standard deviation, when
something like interquartile range is a lot easier to understand and probably easier to compute?
The answer is that, in a lot of stat, we’re taking a sample and then finding the mean of that sample. The
first thing that you do when you find a mean is to add together the observations. And it turns out that
variance acts very nicely when you add observations together. Let me show you.
Suppose that you have five wooden blocks of various heights, and I have five wooden blocks of various
heights. What we’re going to do is this: You pick one of your blocks at random, and I’ll pick one of my
blocks at random, and we’ll make a tower by stacking these blocks up. The question is—what can you say
about the height of the resulting tower? To lock this down, let’s be specific. Let’s say that your blocks
have heights of 1”, 2”, 4”, 4” and 4”. My blocks have heights of 1”, 1”, 2”, 3” and 6”. Let’s have Excel
compute some values from this: First, we’ll look at the 25 different towers we could build—5 choices for
you, 5 choices for me.
My choice
1
1
2
3
6
1
2
2
3
4
7
Your choice
2
3
3
4
5
8
4
5
5
6
7
10
4
5
5
6
7
10
4
5
5
6
7
10
Now I’ll ask Excel to find a bunch of stats—for your blocks, for my blocks, and for our tower. Here they
are.
Your block
Our
Tower
My Block
Mean
3
2.6
5.6
Standard Error
0.632455532 0.92736185 0.458258
Median
4
2
5
Mode
4
1
5
Standard Deviation 1.414213562 2.073644135 2.291288
Sample Variance
2
4.3
5.25
Population Variance
1.6
3.44
5.04
Population Std Dev 1.264911064 1.854723699 2.244994
Range
3
5
8
Minimum
1
1
2
Maximum
4
6
10
Sum
15
13
140
Count
5
5
25
10
Note that your mean plus my mean = the tower’s mean. That makes sense. If your blocks are on average
3” high, and my blocks are on average 2.6 inches high, then our tower is on average 5.6” high. Note that
the median does not have this additive property—4 + 2 is not 5. The mode worked in this example, but
won’t usually.
How about the measures of dispersion? Nothing going on with (sample) standard deviation and sample
variance…but look at population variance! 1.6 + 3.44 = 5.04. Our variances add! This property also
holds for range (3 + 5 = 8), but range isn’t really a very useful measure of spread, since it only depends on
the most extreme data points—here, our shortest and tallest blocks. But variance takes all of the blocks
into account, and it still adds. It turns out that this property is incredibly useful, and we’ll need it very
soon, when we begin to draw conclusions from samples.
Oh, by the way, this variance addition ONLY works because MY choice of block was independent of
YOUR choice of block. If you picked a tall one whenever I picked a tall one, for example, the relation
would have failed.
11