Download Chapter 4 - Averages and Standard Deviation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Chapter 4 - Averages and Standard Deviation
PART II : DESCRIPTIVE STATISTICS
Dr. Joseph Brennan
Math 148, BU
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
1 / 44
Description of Distributions
Similar to chapter 3, we will only be handling variables that are
quantitative in nature.
To describe the distribution of a quantitative variable we should specify :
The overall shape of the distribution:
Number of modes,
Types of Symmetry,
Types of Skew.
Numerical descriptions of the distribution. These are measures of
Center,
Spread.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
2 / 44
Measures of Center and Spread
We will consider 3 measures of the center of a distribution :
the mode,
the average (mean),
the median.
We will also discuss 2 measures of the spread of a distribution :
the standard deviation,
the interquartile range.
Notation
Suppose we have a data set which consists of n observations.
Denote observations by x1 , x2 , . . . , xn .
Consider x1 as the first observation and xn is the last n - th
observation.
The subscripts on the observations, xi , are just a way of keeping the n
observations distinct. They do not necessarily indicate order or any
other special facts about the data.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
3 / 44
The Mode
The Mode
The mode is the number that occurs most frequently in a given data. A
mode can be visually determined from a histogram as it will coincide with
a peak. There may be several modes!
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
4 / 44
The Mean (Average)
The Mean
The mean is the numerical center of data. It is the common average by
which you have been graded since childhood.
Typically, the mean of a set of data will be denoted by x̄. The mean of a
population (found through a census) is denoted µ.
The mean x̄ for a set of observations is determined by adding all values
together and dividing by the number of observations. Typically, the
number of observations will be denoted by n.
n
x̄ =
x1 + x2 + . . . + xn
1X
=
xi .
n
n
i=1
The Σ (capital Greek sigma) in the above formula is short for “add them all up”.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
5 / 44
Example 1 (Student Height)
The heights (in inches) of 10 students are given below :
71, 70, 68, 69, 68, 65, 72, 69, 71, 62.
What is a mode height? There happen to be three:
68
69
71
What is the mean height?
x̄ =
71 + 70 + 68 + 69 + 68 + 65 + 72 + 69 + 71 + 62
= 68.5 (inches)
10
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
6 / 44
Example 2 (Temperature)
A biological experiment takes place in an orchard. The outside
temperature (in degrees Fahrenheit) is taken every hour. The first 5
successive measurements,
63, 66, 69, 70, and 75,
were taken at 8, 9, 10, 11, and 12 a.m., correspondingly.
What is the mode temperature? There isn’t one!
What is the mean temperature?
x̄ =
63 + 66 + 69 + 70 + 75
= 68.6 (degrees Fahrenheit)
5
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
7 / 44
Example 2 (Temperature)
Suppose that we recorded the last number wrong and accidentally
recorded a temperature 105 instead of 75. How will the average change?
x̄error =
63 + 66 + 69 + 70 + 105
= 74.6
5
We can track the actual change between the actual mean and the mean
found in error:
x̄error =
63 + 66 + 69 + 70 + 75 30
+
= x̄true + 6
5
5
A single wrong temperature (an outlier!) has shifted the mean
temperature up by 6 degrees!
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
8 / 44
Weakness of the Average
Example 4 illustrates an important weakness of the average.
The average is sensitive to the influence of extreme
observations.
Extreme observations may be outliers, but a skewed distribution that has
no outliers will also shift the mean towards the long tail. This will be
discussed in detail later. In statistical language we say that the average is
not a robust measure of the center.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
9 / 44
Properties of the Average
1
The average is always between the smallest and the biggest number
in the data set.
2
The average is the center of gravity of the histogram.
3
The average is not resistent (robust) to outliers and to a skewness of
a distribution. The average shifts towards the long tail of the
distribution.
4
The average x̄ estimates the (unknown) population mean µ.
5
The average value x̄ is the best predictor for a future value of a
variable.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
10 / 44
The Median
The median is the midpoint of a distribution. For a data set the median is
the number such that half of observations are smaller and the other half
are larger. Typically, the median will be denoted x̃.
To find the median of a distribution :
1
Arrange all observations in order of size, from smallest to largest.Be
sure to list all observations, even if the same values are repeated
several times.
2
If the number of observations n is odd, the median x̃ is the center
observation in the ordered list. The location of the median can be
found by counting (n + 1)/2 observations up from the bottom of the
list.
If the number of observations n is even, the median x̃ is any number
between two center observations in the ordered list.
3
When n is even, we will usually take the median x̃ to be the average of
the two center observations in the ordered list.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
11 / 44
Example 1 (Height) and 2 (Temperature) Revisited
Example 1 The sample size for the height of students is n = 10, which is even.
The median (69) is the average of the two middle observations in the list:
62 65 68 68 69 69 70 71 71 72
Example 2 The sample size for the temperature is n = 5, which is odd. The
median is the 3rd observation in this list:
63 66 69 70 75
When the last temperature was recorded wrong, the median is again 69 :
63 66 69 70 105
As we can see, an outlier does not change the median!
The median is resistant (robust) to the influence of extreme
observations.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
12 / 44
Properties of the Median
1
The median is always between the smallest and the biggest number in
the data set.
2
The median is the value which divides the area of the histogram by
half. The area of the histogram to the left of the median is equal to
the area of the histogram to the right of the median.
3
The median is resistent (robust) to the influence of extreme
observations and to the skewness of a distribution.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
13 / 44
What Measure of Center is Applicable?
Different measures of center are appropriate in different situations.
Consider several examples:
1
A shoe store is interested in which size of shoes is of greatest demand.
This question is about the mode of the distribution of shoe sizes.
2
An economist studying household incomes is interested in the middle
income value, the economist wants to de-emphasize the impact of the
few very high incomes that are typically present in such a data set.
The economist is interested in the median income.
3
An instructor is asked what was the average score on the test.
The instructor is interested in the mean grade.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
14 / 44
What Measure of Center is Applicable?
It is important that there is not a unique notion of center! A proper
measure of center should may be chosen based upon the study question
and looking at the available data.
Data whose distribution is roughly symmetric has a median, x̃, and
mean, x̄, close together.
Note that for a perfectly symmetric distribution the mean equals the
median. Real data sets, however, never have perfectly symmetric
distributions.
Data whose distribution is highly skewed (either left or right) has a
noticibly seperate median and mean.
Note that for a highly skewed distribution the median is a preferred
center, as it is robust, unlike the mean, and isn’t affected by outliers.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
15 / 44
Center Measurements on a Histogram
Figure : Figure 5. Measures of center.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
16 / 44
Measures of Spread
Example 3
Consider two sets of data :
Set A:
Set B:
48, 49, 51, 53, 45, 47, 55, 50, 51, 51
10, 65, 17, 89, 100, 40, 21, 99, 34, 25
Figure : Figure 6. Histograms for Sets A and B.
Both sets have the same mean of 50, but the spread of the distribution in
set B is much greater than it is in set A.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
17 / 44
The Sample Standard Deviation
The sample standard deviation measures the spread by looking at how far
the observations are from their average x̄. Typically, the standard
deviation will be denoted by s.
The formula for the standard deviation is:
v
u n
u1 X
(xi − x̄)2
s=t
n
i=1
NOTE: There is an alternative way to compute s, which is more efficient
in some cases :
v
u n
q
u1 X
2
2
t
s=
xi − x̄ = x 2 − x̄ 2 ,
n
i=1
where x 2 is the average of squared data values.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
18 / 44
Interpreting the Standard Deviation
v
u n
u1 X
s=t
(xi − x̄)2
n
i=1
1
Begin with xi − x̄:
First compute the sample mean x̄.
Second, for each data point xi , record the difference between xi and x̄.
xi − x̄ is a measure of deviation a data point is from the mean.
Deviations may be positive or negative. However, the sum of the
deviations is zero.
2
Proceed to (xi − x̄)2 :
Squaring the deviations makes them positive.
3
Proceed to
1
n
Pn
i=1 (xi
− x̄)2 :
We now find the mean of the squared deviations. Do not confuse this
mean with the sample mean x̄.
4
Finish by taking a square root, effectively undoing the square.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
19 / 44
Properties of the Sample Standard Deviation
1
s measures spread about the mean. The standard deviation is
connected to only the mean among center measures.
2
s = 0 only when there is no spread. This happens only when all the
observations are identical. Otherwise s is positive. As observations
become more spread out about their mean, s gets larger.
3
s, like the average x̄, is not robust. Distributions with outliers and
strongly skewed distributions have large standard deviations.
4
s has the same unit of measurement as the original observations.
5
For bell-shaped distributions, s can be interpreted as a deviation of
typical observation from the mean.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
20 / 44
Example 2 (Temperature)
Let us compute the standard deviation for the original data set: 63, 66,
69, 70, 75.
Step 1. Compute the average x̄: x̄ = 68.6 (completed earlier).
Step 2. Compute the deviations: See column 2 below:
Observation
63
66
69
70
75
Sum
Deviation
63-68.6= -5.6
66-68.6= -2.6
69-68.6= 0.4
70-68.6= 1.4
75-68.6= 6.4
0
Squared deviation
31.36
6.76
0.16
1.96
40.96
81.2
Step 3. Square the Deviations.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
21 / 44
Example 2 (Temperature)
Step 4. Average squared deviations:
n
1
1X
(xi − x̄)2 = (31.36 + 6.76 + 0.16 + 1.96 + 40.96) = 16.24
5
n
i=1
Step 5. Take the square root of the averaged squared deviations:
v
u n
√
u1 X
(xi − x̄)2 = 16.24 ≈ 4.03
s=t
n
i=1
We have computed that the mean temperature is 68.6 degrees Fahrenheit,
while the standard deviation among the data points is 4.03 degrees
Fahrenheit.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
22 / 44
Example 2 (Temperature)
Now let us recompute the standard deviation for the case when the last
observation was recorded wrong (105 instead of 75). We have already
calculated the mean for this case, x̄error = 74.6. Computing:
Observation
63
66
69
70
105
Sum
Dr. Joseph Brennan (Math 148, BU)
Deviation
63-74.6=-11.6
66-74.6=-8.6
69-74.6=-5.6
70-74.6=-4.6
105-74.6=30.4
0
Squared deviation
134.56
73.96
31.36
21.16
924.16
1185.2
Chapter 4 - Averages and Standard Deviation
23 / 44
Examples
In this case, when the last observation is recorded wrong,
r
1185.2
serror =
≈ 15.40
5
which is almost 4 times greater than the standard deviation for the
original data set.
Remember, the standard deviation is like the mean, NOT robust.
Example 5 What is the standard deviation for the data
5, 5, 5, 5, 5, 5, 5, 5, 5, 5 ?
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
24 / 44
Visual Intuition
There are physically intuitive interpretations of mean, median and mode.
Consider any histogram sketch associated to a density histogram. The peaks are
the modes and are the easiest to spot.
The median is the first place where a vertical line splits the area under the
density histogram equally.
The mean is the centre of gravity of the histogram, thought of as a physical mass.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
25 / 44
Visual Intuition
To illustrate the point that the mean plays the role of center of gravity and
why it should differ from the median consider two density histogram which
have similar shapes but are still different :
mean
mean
Although the median stays the same, the mean moves to the right as the
second peak moves to the right!
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
26 / 44
Percentiles
The p th percentile of the data is such a value that p percent of the
observations fall at or below it.
The median is the 50th percentile.
The most commonly used percentiles other than the median are the
quartiles.
The first quartile, Q1 , is the 25th percentile, and the third quartile, Q3 ,
is the 75th percentile.The second quartile, obviously, is the median
itself.
NOTE: 50% of the observations are located between the quartiles Q1
and Q3 .
Percentiles, and, in particular, quartiles are useful numerical characteristics
of the data distribution. The quartiles are used to compute the
interquartile range.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
27 / 44
The Interquartile Range
The Interquartile Range, IQR is the distance between the first and third
quartiles:
IQR = Q3 − Q1 .
To calculate the quartiles:
Arrange the observations in increasing order.
If the number of observations n is even, split the ordered data set
into 2 parts. Find the first quartile Q1 as the median of the first half
of the data set. Similarly, the third quartile Q3 is the median of the
second half of the original data set.
If the number of observations n is odd, split the data set in two
halves by excluding the central value (the median). After that find Q1
and Q3 as the medians of the corresponding halves.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
28 / 44
Example 1 (Heights)
The heights (in inches) of 10 students are given below:
71, 70, 68, 69, 68, 65, 72, 69, 71, 62.
Notice that we are dealing with an even number of observations:
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
29 / 44
Examples 2 (Temperature)
A biological experiment takes place in an orchard. The outside
temperature (in degrees Fahrenheit) is taken every hour for 5 hours:
63, 66, 69, 70, and 75,
Notice that we are dealing with an odd number of observations:
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
30 / 44
Summary of Center and Spread
For a data distribution using a quantitative variable we now have multiple
ways to describe the center and spread:
Center:
Mean,
Median,
Mode.
Spread:
Standard Deviation,
Interquartile Range.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
31 / 44
Linear Transformations
Suppose we have a set of numbers which we want to transform to another
set of numbers which will have different units of measurement. We will
consider only the linear transformations of variables, which have the
following form :
xnew = a + bx,
(1)
where xnew is the variable in new units, x is the old variable, and a and b
are numbers.
The key word is linear. Linear transformations graph as a straight line
with y-intercept a and slope b.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
32 / 44
Examples of Linear Transformations
The distance in kilometers is translated into distance in miles using
the formula
M = 0.62K ,
(2)
where M is the distance in miles, and K is the distance in kilometers.
For instance, a 10-kilometer race covers 6.2 miles.
The temperature in degrees Fahrenheit is translated into temperature
in degrees Celsius as
160 5
5
+ F,
C = (F − 32) = −
9
9
9
(3)
where C is the temperature in degrees Celsius, and F is the
temperature in degrees Fahrenheit. For instance, 95◦ F translates into
35◦ C while −40◦ F translates into −40◦ C.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
33 / 44
The Effect of a Linear Transformation.
How do the numerical measures of center and spread change after a
linear transformation?
We will separately consider 2 special cases of linear transformation:
Data Shifts: a special case of the transformation (1) when b = 1:
xnew = x + a,
which corresponds to adding a constant a to every observation.
Scale Changes: a special case of the transformation (1) when a = 0, b 6= 0:
xnew = bx,
which corresponds to multiplying each observation by a constant b (positive
or negative). Transformation (2) is an example of the scale transformation.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
34 / 44
Effects of a Shift Transformation
If we act upon a data set by a shift transformation:
xnew = x + a,
the change in spread and center are recorded:
Mean
1st Quatrile
x̄new = x̄ + a
Median
Q1, new = Q1 + a
3rd Quartile
x̃new = x̃ + a
Standard Deviation
snew = s
Dr. Joseph Brennan (Math 148, BU)
Q3, new = Q3 + a
Interquartile Range
IQRnew = IQR
Chapter 4 - Averages and Standard Deviation
35 / 44
Example (Football team)
Wrong Scale: Every player of a highschool football team was weighed
using the same scale. If it was discovered later that the scale was 10 lb
under, so we need to add 10 lb to every weight, what would happen to
each of the following measurements?
Characteristic
Average
Median
Q1
Q3
Standard Deviation
IQR
Dr. Joseph Brennan (Math 148, BU)
Original
230 lb
240 lb
200 lb
280 lb
50 lb
80 lb
After Adjustment
240 lb
240 lb
210 lb
290 lb
50 lb
80 lb
Chapter 4 - Averages and Standard Deviation
36 / 44
Effects of a Scale Transformation
If we act upon a data set by a scale transformation:
xnew = bx,
the change in spread and center are recorded:
Mean
1st Quatrile
x̄new = bx̄
Q1, new = bQ1
Median
3rd Quartile
x̃new = bx̃
Q3, new = bQ3
Standard Deviation
Interquartile Range
snew = |b| · s
IQRnew = |b| · IQR
where | · | denotes absolute value.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
37 / 44
Example 7 (Football team)
Now suppose we found out that we are supposed to report weights and all
the summary measures in kilograms, not in pounds! Recall that 1 lb =
0.453 kilograms. What would happen to each of the following
measurements?
Characteristic
Mean
Median
Q1
Q3
Standard Deviation
IQR
Dr. Joseph Brennan (Math 148, BU)
Original
230 lb
240 lb
200 lb
280 lb
50 lb
80 lb
After Adjustment
104 lb
109 lb
91 lb
127 lb
23 lb
36 lb
Chapter 4 - Averages and Standard Deviation
38 / 44
Effects of a General Linear Transformations.
If we act upon a data set by any linear transformation:
xnew = bx + a,
the change in spread and center are recorded:
Mean
1st Quatrile
x̄new = bx̄ + a
Q1, new = bQ1 + a
Median
3rd Quartile
x̃new = bx̃ + a
Q3, new = bQ3 + a
Standard Deviation
Interquartile Range
snew = |b| · s
IQRnew = |b| · IQR
where | · | denotes absolute value.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
39 / 44
The Empirical Rule
As a general rule, if the data distribution is unimodal, roughly symmetric,
and bell-shaped, then:
Approximately 68% of the observations fall within one standard
deviation of the average, i.e., approximately 68% of data values are
between x̄ − s and x̄ + s.
Approximately 95% of the observations fall within 2 standard
deviations of the average, i.e., approximately 95% of data values are
between x̄ − 2s and x̄ + 2s.
Approximately 99.7% of the observations fall within 3 standard
deviations of the average, i.e., approximately 99.7% of data values
(almost all the observations) are between x̄ − 3s and x̄ + 3s.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
40 / 44
The Empirical Rule
Figure : Figure 1. Empirical Rule Illustration.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
41 / 44
WARNING: The Empirical Rule Strikes Back
Don’t throw caution (and, perhaps, lightsabers) to the wind while using
the empirical rule! Keep in mind:
The Empirical Rule does NOT give the exact percentages of
observations within one, two, or three standard deviations, just
approximate percentages.
The Empirical Rule works well just for symmetric bell-shaped
histograms.
The Empirical Rule will not be too off for lightly skewed distributions,
but it will be very wrong for moderately or heavily skewed
distributions.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
42 / 44
Example (HANES5 study)
HANES5 (The Health and Nutrition Examination Survey) was a study
done in 2003-2004 recording the height (in inches) of women (see p. 58).
The mean height was x̄ = 63.5 inches and the standard deviation s = 3
inches.
The histogram (found on the next slide) is approximately symmetric and
bell-shaped, so the Empirical Rule should hold.
Dr. Joseph Brennan (Math 148, BU)
Chapter 4 - Averages and Standard Deviation
43 / 44
Example (HANES5 study)
The shaded region corresponds to the
women with height within 1 standard
deviation of the average:
x̄ ± s = 63.5 ± 3 = (60.5, 66.5)
The true area of the shaded region is
72%, which is fairly close to 68%.
Dr. Joseph Brennan (Math 148, BU)
The shaded region corresponds to the
women with height within 2 standard
deviation of the average:
x̄ ± 2s = 63.5 ± 2 · 3 = (57.5, 69.5)
The true area of the shaded region is
97%, which is fairly close to 95%.
Chapter 4 - Averages and Standard Deviation
44 / 44