Download MATH2560 C F03 Elementary Statistics I Lecture 2: Describing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
MATH2560 C F03
Elementary Statistics I
Lecture 2: Describing Distributions with
Numbers.
1
Outline.
⇒ mean;
⇒ median;
⇒ quartiles;
⇒ boxplots;
⇒ variance;
⇒ standard deviation;
⇒ linear transformation.
2
Description of a Distribution with Numbers
A numerical summary of a distribution should report its center and its
spread or variability, and brief description should include its shape (describing by histograms and stemplots)..
3
Measuring Center: the Mean
The mean x̄ describes the arithmetic average of the observations.
The Mean x̄
To find the mean of a set of observations, add their values and divide by the number
of observations. If the n observations are x1 , x2 , ..., xn , their mean is
x̄ =
x1 + x2 + ... + xn
n
or, in more compact notation,
n
1X
x̄ =
xi .
n i=1
3.1
Examples.
1. The Babe’s mean number of home runs hit in a year is:
x̄ =
1
(54 + 59 + ... + 22) = 43.9.
15
2. Roger Maris’s mean, from the data of Lecture 1, is:
x̄ =
1
261
(8 + 13 + ... + 61) =
= 26.1.
10
10
Ruth’s superiority is evident from these averages: 43.9 > 26.1.
If numerical description can resist the influence of extreme observations,
we say that it is a resistant measure. For example, mean cannot resist it
and mean is not a resistant measure of center.
4
Measuring Center: the Median
The median M describes the midpoint of the observations.
The Median M
The meadian M is the midpoint of a distribution, the number such that half the
observation are smaller and the other half are larger. To find the median of a
distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median M is the center observation in
the ordered list. Find the location of the median by counting (n + 1)/2 observations
up from the bottom of the list.
3. If the number of observations n is even, the median M is the mean of the two
center observations in the ordered list. The location of the meadian is again (n+1)/2
from the bottom of the list.
4.1
Examples.
1. Babe Ruth median:
1. Arrange the data in increasing order:
22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60.
2. The median is the bold 46, the eight observation in the ordered list.
3. You can also find it using the recipe (n + 1)/2 = 16/2 = 8 to locate
the median in the list.
2. Roger Maris median:
1.
8, 13, 14, 16, 23, 26, 28, 33, 39, 61.
2. Number of observation is even: n = 10. Hence,
M=
23 + 26
49
=
= 24.5.
2
2
3. The recipe (n+1)/2 = 11/2 = 5.5 for the position of the median in the
list means that the median is at location ”five and one-half”, that is, halfway
between the fifth and sixth observations.
The mean and median describe the center of a distribution in different
ways.
4.2
Mean versus Median
The median is a ”middle value” rather than the mean which is ”arithmetic average value”.
The mean and meadian for symmetric distribution are exactly the same.
In a skewed distribution, the mean is farther out in the long tail than is
the median.
5
Measuring Spread: The Quartiles
The quartiles is given to describe the spread of the distribution, when you
use the median to describe the center of the distribution.
pth-percentile of a distribution is the value such that p percent of the
observations fall at or below it.
The most commonly used percentiles other than the median (50th-percentile)
are the quartiles.
The first quartile is the 25th percentile, and the third quartile is the 75th
percentile.
The first quartile Q1 has 1/4 of the observations below it.
The third quartile Q3 has 3/4 of the observations below it.
How to Calculate the Quartiles Q1 and Q3 ?
To calculate the quartiles:
1. Arrange the observations in increasing order and
locate the median M in the ordered list of observations.
2. The first quartile Q1 is the median of the
observations whose position in the ordered list is to
the left of the location of the overall median.
3. The third quartile Q3 is the median of the
observations whose position in the ordered list is to
the right of the location of the overall median.
5.1
Example 1.15.
In Example 1.5 (shopping data) let us arrange the data in order after rounding to eliminate the cents. We obtain 3, 9, 9, 11, ..., 28, ||, 28, ..., 86, 86, 93.
There n = 50 observations, the position of the median is (50 + 1)/2 = 25.5 or
midway betweeen the 25th and 26th observations. This location is marked
by ||. To find the first quartile, consider the 25 observations falling to the left
of the location || of the median. The median of these 25 observations is the
thirteenth in order, or 19D. This is the first quartile. The third quartile
is the thirteenth value above, or 45D. Summarizing these results in compact
form we receive:
Q1 = 19D
M = 28D
Q3 = 45D.
It is easy to find other percentiles. Since 0.95 × 50 = 47.5, then the 95th
percentile of the 50 observations to be the 48th in the ordered list, namely
86.
The median for Ruth’s 15 home run
22, 25, 34, ..., 46, 46, 46, ..., 59, 60
total is the 46. The first quartile is the median of the seven observations
falling to the left of this point in the list, Q1 = 35. And, similarly, Q3 = 54.
6
Measuring Spread: the Interquartile Range
The interquartiles range (IQR) is the difference between the quartiles:
IQR = Q3 − Q1 .
It is the spread of the center half of the data.
Example 1.15: Shopping Data: IQR = 45 − 19 = 26D. The quartiles
and the IQR are not affected by the changes in either tail of the distribution.
They are threfore resistant.
6.1
The 1.5 × IQR Criterion for Outliers
The 1.5 × IQR criterion flags observations more than 1.5 × IQR beyond
the quartiles as possible outliers. It call an observation a suspected outlier
if it falls more than 1.5 × IQR above the third quartile or below the first
quartile.
Example 1.15. 1.5 × IQR = 1.5 × 26D = 39D. Any values below
19 − 39 = −20D or above 45 + 39 = 84 are flagged as possible outliers.
The flagged values 86D and 93D do not apper to be outliers in the sense of
deviations from the overall pattern of the distribution.
7
The five-number Summary
The five-number summary (FNS) provides a quick overall description of
a distribution.
FNS consists of the median+2quartiles+smallest+largest individual observations (IO).
The median describes the center, and the quartiles and extremes (smallest
and largest IO) show the spread.
The Five-Number Summary
Minimum
Q1
M
Q3 Maximum
Example 1.15. The five-number summary is:
3,
19,
28,
45,
93.
8
Boxplots.
Another visual representation of a distribution is boxplot. Boxplots based
on the five-number summery are useful for comparing several distributions.
The box spans the quartiles and shows the pread of the central half of
the distribution.
The median is marked within the box.
Lines extend from the box to the extremes and show the full spread of
the data. (The points identified by the 1.5 × IQR criterion are often plotted
individually).
Boxplot
A boxplot is a graph of the five-number summary,
with suspected outliers plotted individually.
1. A central box spans the quartiles.
2. A line in the box marks the median.
3. Observation more than 1.5 × IQR outside the central
box are plotted individually as possible outliers.
4. Lines extend from the box out to the smallest and largest
observations that are not suspected outliers.
8.1
Comparing Distributions
Boxplots are most useful for comparing distributions. Consider the following
test results for calories and milligrams of sodiumin a number of hot dogs (see
Table 1.8).
The five-number summeries of the distributions of calories
Type
Min.
Q1
M
Q2
Max.
Beef
111
140
152.5
178.5
190
Meat
107
138.5
153
180.5
195
Poultry
86
100.5
129
143.5
170
We can see that no observations fall more than 1.5 × IQR outside the quartiles. Figure 1.16 presents boxplots based on these calculations.
Figure 1.16 also illustrates why boxplots are generally inferior to stemplots and histograms as displays of a single distribution.
Let us make a stemplot of the calorie content of the 17 brands of meat
hot dogs:
Stemplot of the calorie content
10 7
11
12
13 5689
14 067
15 3
16
17 2359
18 2
19 015
There are two distinct clusters and one outlier in the lower tail. The
boxplot hid the clusters.
9
Measuring Spread: the Standard Deviation
The variance s2 and especcially its square root, the standard deviation
s, are common measures of spread about the mean as center.
Variance
The variance s2 of a set of observations is the average of the
squares of the deviations of the observations from their mean.
In symbols, the variance of n observations x1 , x2 , ..., xn is
2
2 +...+(x −x̄)2
n
s2 = (x1 −x̄) +(x2 −x̄)
n−1
P
n
1
2
or s2 = n−1
i=1 (xi − x̄) .
The idea of the variance is clear: it is the average of the squares of the
deviations (xi − x̄), i = 1, ..., n, of the observations xi from their mean x̄.
The standard deviation measures spread by looking at how far the
observations are from their mean.
The standard deviation s is zero when there is no spread.
It gets larger as the spread increases.
The Standard Deviation s
s is the
root of the variance s2 :
q square
Pn
1
2
s = n−1
i=1 (xi = x̄) .
Example 1.18. Metabolic rates of seven men:
1792, 1666, 1362, 1614, 1460, 1867, 1439.
Here,
x̄ = 1600 x1 − x̄ = 1792 − 1600 = 192
calories.
s = 189.24
9.1
Properties of Standard Deviation
Properties of s
1. s measures spread about the mean and should be used
only when the mean is chosen as the measure of center.
2. s = 0 only when there is no spread. This happen only when all
observations have the same value Otherwise, s > 0. As the observations
become more spead out about their mean, s gets larger.
3. s, like the mean x̄, is not resistant.
A few outliers can make s very large.
9.2
Choosing Measures of Center and Spread
Choosing Summary
The five-number summary is usually better
than the mean and standard deviation for describing
a skewed distribution or a distribution with strong
outliers.
Use x̄ and s only for reasonably symmetric
distributions that are free of outliers.
10
Changing the Unit of Measurement
Linear Transformations
A linear transformation changes the original variable x
into new variable xnew given by an equation of the form
xnew = a + bx.
Adding the constant a shifts all values of x upward or downward
by the same amount. Such a shift changes the origin (zero point)
of the variable. Multiplying by the positive constant b changes
the size of the unit of measurement.
10.1
Example 1.20.
(a) Transformation kilometers into miles:
xnew = 0.62x.
So, 10km is 6.2 miles.
(b) Transformation Fahrenheit into Celcius:
5
−160 5
xnew = (x − 32) =
+ x.
9
9
9
So, 95F is 35C : 35 = xnew = −160
+ 59 95.
9
A LT changes the origin if a 6= 0 and changes the size of the unit of
measurement if b > 0.
LTs do not change the overall shape of a distribution.
A LT multiplies a measure of spread by b, and changes a percentile or
measure of center m into a + bm.
Effect of Linear Transformations
Apply the following rule to see the effect of a LT
on measures of center and spread:
1. Multiply each observation by a positive number a
multiplies both measures of center (mean and median)
and measures of spread (interquartile range and standard deviation) by b.
2. Adding the same number a (positive or negative) to each observation
adds a to measures of center and to quartiles and other
percentiles but does not change measures of spread.
For example, if x has mean x̄ the transformed variable xnew has mean
a + bx̄.