Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Demonstrations II
Scott Stevens
J. Arithmetic Mean (Average)
[For the mean from grouped data, see M.]
Problem: During the 1980 presidential campaign, Ronald Reagan repeatedly asked voters if they
were better off in 1980 than they were 4 years before. Here are some data (in yellow) on the
unemployment and inflation rates during the Carter Administration (’77 to ’80) and the Reagan
Administration (’81 to ’88). Is there any difference between the average unemployment rate in the
two administrations? Average inflation rate?
This is quite easy to compute by hand. Assuming that by "average" the problem means "mean", we just
take the numbers to be averaged, add them, then divide by the number of numbers added. We can also do
this with the Excel =AVERAGE(range) function.
A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Year
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
B
Unemployment
7.1
6.1
5.8
7.1
7.6
9.7
9.6
7.5
7.2
7
6.2
5.5
Carter Average
Reagan Average
6.525
7.5375
C
Inflation
6.7
9
13.3
12.5
8.9
3.8
9.6
7.5
7.2
7
6.2
5.5
10.375
6.9625
So the Carter unemployment average, for example, was computed as = AVERAGE(B2:B5), since the
unemployment figures for the Carter years were in spreadsheet cells B2 through B5.
The result show that unemployment was lower during the Carter years, but that inflation was considerably
lower during the Reagan years. The question that remains is whether the differences in these figures are
too large to be credited to random fluctuation alone. That question is one we take up later in the semester.
It’s important to note that the average, a single number, cannot tell the whole story of a data set. Be careful
what conclusions you draw from a “one number summary”!
K. Median (and comparison with the mean)
Problem: Below is a table of the rainfall recorded in the Los Angeles area in the last 10 years.
(Unlike most of these demonstrations, I’ve made up this data for this one.) Compute the median
rainfall based on these data. Without computing the mean, state whether the mean rainfall for this
ten year period would be above, below, or equal to this median value.
1
rainfall in a year
3
median rainfall
4
5
5
5
6
6
14
17
5.5
The first step is to sort the data, smallest to largest –I’ve already done this with the data. Then, in a set
of 10 observations, the median should be observation number (10+1)/2 = 5.5…that is, the number
halfway between observation #5 and observation #6. Since #5 is 5” and #6 is 6” in this case, the
median is 5.5”. You can also get this from Excel, by using the =MEDIAN(range) command.
Now, how does this compare to the mean?
If you imagine these bars sitting on a see-saw along the number line of the x-axis, the mean would be
the balance point of the see-saw. (Think of where you’d have to put the fulcrum (or balance point) of
the see-saw to make the “kids” balance. You should be able to see that it’s somewhere near where I
put it below, at 8.5. The three little kids on the right side at positions 14, 17, and 20 will just balance
the two small and two big kids on the left side (at positions 3, 4, 5, and 6).)
4
Big kid at the "5" mark
weighs 3 units
3
2
Little kid way out
at the "20" mark
weighs 1 unit
1
0
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Balance point
On the other hand, the median point for a distribution is the point at which half of the "stuff" is to the
left and half of the "stuff" is to the right. If we stick with our see-saw imagery, "stuff" means "weight",
2
20
and I hope that you can see that 5 units of weight lie to the left of 5.5 (the median value), and 5 units of
weight lie to the right of 5.5
This way of thinking about mean and median is quite useful, so take some time to lock it down. What
you can learn from this example holds in general. Note, for example:

If the last kid slides from position 20 to position 25 on the see-saw, the balance point would
shift to the right (a higher mean), but the median ("half-weight") point would remain
unchanged. In general, the mean is more affected by extreme values than then median.

If the distribution has a long "tail" in one direction (like this one does to the right—it is
"skewed right"), then the mean tends to be more than the median. In general, if a distribution
is unimodal (one highest point) and skewed, then as you run up the long slope of the "hill",
you'll encounter the mean, median, and mode, in that order. The mode, of course, will be at
the top of the hill.

If the distribution is symmetric ("mirror imaged, left to right"), then the mean and median are
equal. If the symmetric distribution is also unimodal (one highest point), then the mode is
also equal to the mean and the median.
Don't memorize these results—understand them. And think about what the measures given in a given
problem really tell you. For example, suppose you’re told that the average number of people in a US
household is 2.4 persons. This is the mean (since it clearly isn't the median or mode!). It is the "balance
point" of the population distribution. Saying it another way—if all of the people in all of the households
were gathered together, and then distributed evenly among all of those households…well, we'd have a
bloody mess, since each household would get 2.4 people.
But that doesn't mean that 2.4 is "typical", or even close to "typical". For example, if 80% of all
households have 1 person, and the remaining 20% all had 8 persons, the average would be 2.4.
The mean is a measure of "middle", but we almost always need a measure of "spread", as well. The most
common is standard deviation.
L. Box Plots (Modified and Unmodified), Quartiles, and IQR
Problem: In the yellow box below, find the annual inches of precipitation at the Los Angeles Civic
Center for the years 1961 to 1990. Summarize this data with a boxplot and modified box plot.
It's not hard to do this work by hand, but the goal of this course is to give you tools that you can use
effectively and responsibly. To help you with this, I've written a number of spreadsheet templates that will
perform common statistical tasks, to supplement those functions already a part of Excel. Sometimes, my
templates duplicate functions already available in Excel. When I do this, it is for one of two reasons.
Either 1) Excel's built-in function is restrictive about how the input data must be supplied, or 2) Excel's
built-in function is not helpful in understanding how the answer is obtained. Since you must understand a
tool clearly in order to use it effectively, my template often provide a "step-by-step" approach.
I'll also be providing templates to perform some of the tasks that Excel can't do at all, or can only do
incompletely. That's what I've done for this example. I've created a couple of templates (available on the
website) to do box plots and modified box plots. .
Before you start using my templates, though, there are some general things you should know about them.
Please check the website post entitled Using Stevens’ Statistical Templates: Useful Information. It uses
this problem as an example.
3
Data
Outlier
?
4.56
5.83
6.49
6.54
7.58
7.98
8.9
8.92
9.11
9.26
9.98
10.7
10.92
11.01
12.31
12.91
14.41
14.97
15.37
16.54
16.69
17
17.45
18
23.66
26.32
26.33
26.81
30.57 outlier
34.04 outlier
Statistics
Values
mean, x-bar
standard deviation, s
2 deviations below mean
1 deviation below mean
1 deviation above mean
2 deviations above mean
Formulas
14.70533
7.810534
-0.91574
6.894799
22.51587
30.3264
=AVERAGE(range)
= STDEV(range)
= x-bar - 2 * s
= x-bar - 1 * s
= x-bar + 1 * s
= x-bar + 2 * s
12.61
8.9675
17.3375
8.37
4.56
29.8925
=MEDIAN(range)
=QUARTILE(range, 1)
=QUARTILE(range, 3)
= 3rd quartile - 1st quartile
= 1st quartile - 1.5 * IQR or smallest val
= 3rd quartile + 1.5 * IQR or largest val
median
1st quartile
3rd quartile
IQR
lower whisker
upper whisker
Frequency Table for Histogram
Modified Boxplot
category size
5
Values from
to
4
9
9
14
14
19
19
24
24
29
29
34
34
39
0
10
20
30
frequency
8
8
8
1
3
1
1
40
Annual Rainfall (inches)
We compute the numbers needed for a modified box plot and unmodified box plot. Let's start with the
modified box plot. Note the commands that Excel uses to find median, 1 st quartile, and 3rd quartile. The
interquartile range (IQR) is just the difference between the first and third quartile.
Formulas for first and third quartile. Different sources compute the quartiles slightly differently. Excel
computes the first quartile as observations number (n +3)/4 in the sorted list of n observations, while your
text uses observation number (n+1)/4. The median is observation (n+1)/2, as in your book. The third
quartile is computed in Excel as observation number (3n+1)/4, while your book uses observation number
(3n+3)/4. We’ll be happy with either of these calculation rules. (Here, for example, the first quartile turns
out to be observation (30 + 3)/4 = 33/4 = 8.25. What is observation number 8.25? It's ¼ of the way from
observation #8 to observation #9 on the sorted list. #8 is 8.92 and #9 is 9.11. You can find the number that
is ¼ of the way from A to B by computing (0.75  A) + (0.25  B). So, for our data, this is 0.75(8.92) +
0.25(9.11) = 8.9675, as reported.) The Excel command for the first quartile is, as you can see, =
QUARTILE(range, 1). The third quartile replaces the "1" with a "3": =QUARTILE(range, 3).
4
In the modified box plot, the lower whisker extends down to 1.5 * IQR below the 1st quartile, and the upper
whisker extends up to 1.5 * IQR above the 3rd quartile. Any data points beyond the whisker's ends are
marked with dots, and identified as outliers. With the unmodified box plot, the whiskers extend all the
way to the most extreme data values—the maximum and minimum observations. My spreadsheet here
computes the "whisker's end" values for both cases, and I’ve provided two different spreadsheet templates
on the website to create the two kinds of box plots. For this course, you’ll be responsible only for creating
the unmodified box plots.
Modified Boxplot
0
10
20
30
Unmodified Boxplot (same data)
40
0
Annual Rainfall (inches)
10
20
30
40
Annual Rainfall (inches)
Let's take a look at the modified box plot, since we can use it to talk about both types of box plots.
We can see that there are no outliers on the lower end; no observed rainfall is more than 1.5 IQRs from the
1st quartile, so the lower tail stops at the lowest observation. The central box shows the 1 st quartile,
median, and 3rd quartile. Remember what this means. 25% of all observations represent rainfall below the
"left wall" of this box (about 8.97"). 25% of all observations lie between the left wall and the median line.
25% more lie between this median line and the "right wall" of the box (about 17.34"). Finally, the highest
25% of the rainfalls fall to the right of the "right wall" of the box. To get a rough idea of what the
histogram for this data would look like, you can imagine dumping the same amount of water in each of
these four “compartments”. The “water” between the 1 st quartile and the median would be higher than any
other “compartment”, indicating that the numbers are crunched together there more than anywhere else.
Conversely, the numbers from the third quartile to the maximum value are “spread out”—they’re not
packed in to their interval as densely.
What else? We see that there are two observations that fall above the end of the upper whisker. We
identify these as outliers—both correspond to more than 30 inches of rain.
And the unmodified box plot? How does it differ? Only in that the upper whisker extends to include both
of the pink outliers.
While I expect you to be able to create an unmodified box plot without needing my spreadsheet, it's
unlikely you'll be able to do it (without my help) in Excel. Excel doesn't support box plots, so I did a fair
amount of work to make it draw them, anyway. When you use my spreadsheet on other data, be sure to
change the axis name ("Annual Rainfall (inches)") so that it fits your problem.
5
M. The mean, from grouped data
[For the mean of ungrouped data, see J.]
Problem: In a random sample of 50 college students, 5 said that they sit “in the very front” of the
class and 21 said that they sit “toward the front”. The GPA of the students in the very front was
10.94 (on a 12 point scale) while for the students who sat “toward the front”, the average GPA was
9.38. What is the average GPA of all 26 of these students?
The answer has to come out exactly as if 5 students had GPAs of 10.94 (the average for the “front” group)
and 21 students have GPAs of 9.38 (the “near front” group). Computing this is easy: Average 26 numbers,
5 of which are 10.94 and 21 of which are 9.38. If you think for a moment, you’ll realize that the math is
going to look like [(5  10.94) + (21  9.38)]/(5 + 21). We can generalize this work to find the mean from
any set of grouped data.
Finding the Mean from Grouped Data in Excel
Excel (as well as a number of software packages) expects that, when you want to do statistics, you'll type in
every single data point. Sometimes, though, like in this problem, you don't really want to do that. You
want to enter the different values observed, and how many times each value was observed. Happily, Excel
can still easily compute the average of data presented in this way. Here's how you do it.
1. Enter your data in two columns, "value" and "frequency".
2. Compute the average of the data with the command
=SUMPRODUCT(valuerange, frequencyrange)/SUM(frequencyrange)
Here, valuerange refers to the cells containing the observed values (the numbers in the "value" column).
frequencyrange refers to the cells containing the number of times each value is observed (the numbers in
the "frequency" column).
You can also find the mean of grouped data from a relative frequency distribution. The formula is even
simpler:
=SUMPRODUCT(valuerange, relfreqrange)
where relfreqrange is the range of cells containing the relative frequencies of the observed values.
We'll use this here.
Location
front
toward front
mean
# of students Average GPA
5
10.94
21
9.38
9.68 =SUMPRODUCT(C2:C3,B2:B3)/SUM(B2:B3)
You could, of course, have typed the 26 numbers in separately, then used the =AVERAGE command.
6