Download Describing Distributions with Numbers.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
1.2 Describing Distributions
with Numbers
Example. Table 1.3 (see TM-13) gives the ages of the presidents at
inauguration and Figure 1.10 (see TM-14) gives a histogram of the data.
Measuring Center: The Mean
Definition. If n observations are denoted by x1 , x2, . . . , xn, their mean
is
x1 + x2 + · · · + xn
n
or in more compact notation
x=
n
1
x=
xi .
n i=1
Example 1.6. The mean of the data in Table 1.3 (see TM-13) is 54.8
years.
Note. To compute the mean of a data set using the Sharp EL-546G,
do the following:
• Put the calculator in “statistics mode” by pressing MODE and
3.
• Press 0 to put the calculator in single-variable statistics mode
(ST0 appears in the display).
1
• Press 2ndF and CA to clear the statistics memory.
• Enter the data and press DATA (on the M+ key) after each
entry.
• To get the mean, press RCL and x (the 4 key).
See pages 36-40 of the calculator owner’s manual for more details.
Measuring Center: The Median
Definition. The median of a data set is the “middle value.” To find
the median:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median M is the center
observation in the ordered list. Find the location of the median by
counting (n + 1)/2 observations up from the bottom of the list.
3. If the number of observations n is even, the median M is the mean
of the two center observations in the ordered list. The location of
the median is again (n + 1)/2 from the bottom of the list.
Example 1.8. The median of this data set is 34:
20 25 25 27 28 31 33 34 36 37 44 50 59 85 86
The median of this data set is 18.5:
5 7 10 14 18 19 25 29 31 33
2
Comparing the Mean and the Medain
Note. The mean and median of a symmetric distribution are close
together. In a skewed distribution, the mean is farther out in the
long tail than is the median. This is because a few outliers can
changed the mean, but may have no effect on the median.
Measuring Spread: The Quartiles
Definition. The range of a data set is the difference between
the largest and smallest observations. The first quartile Q1 lies
one-quarter of the way up the list, the third quartile Q3 lies threequarters of the way up the list.
Note. To calculate the quartiles:
1. Arrange the observations in increasing order and locate the median M in the ordered list of observations.
2. The first quartile Q1 is the median of the observations whose
position in the ordered list is to the left of the location of the
overall median.
3. The third quartile Q2 is the median of the observations whose
position in the ordered list is to the right of the location of the
overall median.
Example 1.10. We saw above that the median of the data set:
20 25 25 27 28 31 33 34 36 37 44 50 59 85 86
3
is 34. The first quartile is the median of the 7 observations to the
left of the median, and so Q1 = 27. Similarly, Q3 = 50. For the
data:
5 7 10 14 18 19 25 29 31 33
we have Q1 = 10 and Q3 = 29.
The Five-Number Summary and Boxplots
Definition. The five-number summary of a data set consists of
the smallest observation, the first quartile, the median, the third
quartile, and the largest observation, written in order from smallest
to largest. In symbols the five-number summary is:
Minimum Q1
M
Q3
Maximum
A boxplot is a way to graphically represent the five-number summary.
Example. The data above is represented in a boxplot in Figure
1.11 (see TM-15).
Measuring Spread: The Standard Deviation
Definition. The variance s2 of a set of observations is the average
of the squares of the deviations of the observations from their
mean. In symbols, the variance of n observations x1 , x2, . . . , xn
is
s2 =
(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2
n−1
4
or more compactly,
s2 =
n
1 (xi − x)2 .
n − 1 i=1
The standard deviation s is the square root of the variance s2 :
s=
n
1 (xi − x)2 .
n − 1 i=1
(Some texts call these the “sample variance” and “sample standard deviation” - versus the population variance and standard deviation.)
Example 1.11. Consider the data set:
1792 1666 1362 1614 1460 1867 1439
The mean is x = 1600. We can calculate the variance as:
Observations
Deviations
Squared Deviations
xi
xi − x
(xi − x)2
1792
1792 − 1600 = 192
1922 = 36, 864
1666
1666 − 1600 = 66
662 = 4, 356
1362
1362 − 1600 = −238
(−238)2 = 56, 644
1614
1614 − 1600 = 14
142 = 196
1460
1460 − 1600 = −140
(−140)2 = 19, 600
1867
1867 − 1600 = 267
2672 = 71, 289
1439
1439 − 1600 = −161
(−161)2 = 25, 921
sum = 0
sum = 214,870
So the variance is
s2 =
n
1 1
(xi − x)2 = (214, 870) = 35, 811.67.
n − 1 i=1
6
5
The standard deviation is
s = 35, 811.67 = 189.24.
Note. Your Sharp EL-546G calculator can much more easily calculate variance and standard deviation. Do the following:
• Put the calculator in “statistics mode” by pressing MODE
and 3 .
• Press 0 to put the calculator in single-variable statistics mode
(ST0 appears in the display).
• Press 2ndF and CA to clear the statistics memory.
• Enter the data and press DATA (on the M+ key) after each
entry.
• To get the (sample) standard deviation, press RCL and sx
(the 5 key).
See pages 36-40 of the calculator owner’s manual for more details.
Note. Some properties of the standard deviation are:
• s measures spread about the mean and should be used only
when the mean is chosen as the measure of center.
• s = 0 only when there is no spread. This happens only when
all observations have the same value. Otherwise s > 0. As
the observations become more spread out about their mean, s
gets larger.
• s, like the mean x, is strongly influenced by extreme observa6
tions. A few outliers can make s very large.
Note. The five-number summary is usually better than the mean
and standard deviation for describing a skewed distribution. Use
x and s only for reasonably symmetric distributions.
7