Download Document

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
1.2: Describing Distributions
with Numbers
A. Measures of Location (Center)
 Mean
 Median
B. Measures of Spread (Variability)
 Quartiles (Quantiles)
 Variance and Standard deviation
Measures of Location
1. Mean (Average)
How to find the mean (average):
1) Add the values together
2) Divide the total by the number of observations
• Example: Test Scores : 56, 65, 54, 55, 57, 54, 61, 62, 60, 55, 57,
56, 57, 61, 62, 60, 49, 66, 59, 80
Step 1 : 56 + 65 + 54 + …… + 59 + 80 = 1186
Step 2 : 1186 / 20 = 59.3
Mean
Mean
To find the mean x of a set of observations, add their values and
divide by the number of observations. If the n observations are
x1 , x2 , x3 , ….. , xn , their mean is :
x 1 + x 2 + x 3 + ... + x n
x =
n
Or, in more compact notation:
x
=
1
 xi
n
2. Median
How to find the median M :
1) Arrange the observations in order from smallest to largest.
2) If the number of observations is odd, then the median is
located at the center of the list. So, if there are n observations,
then the median is located in spot (n + 1) / 2
3) If the number of observations is even, then the median is
the average of the two terms in the middle spots. These are
located in spots (n / 2) and (n / 2) + 1
Median
Example of finding a Median :
List 1 : 2, 4, 6, 3, 5, 2, 6, 8, 10, 11, 1
Step 1: Order the list :
1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11
Step 2 : Find the middle term2 : (n+1) / 2 = (11 + 1) / 2 = 6
1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11
Median
Median
Example of finding a Median :
List : 2, 4, 6, 3, 5, 2, 6, 8, 10, 11, 1, 12
Step 1: Order the list :
1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11, 12
Step 2 : Find the two middle terms :
(n / 2) + 1 = (12 / 2) + 1 = 7
n / 2 = 12 / 2 = 6
Step 3 : Average the sixth and seventh terms :
1, 2, 2, 3, 4, 5, 6, 6, 8, 10, 11, 12
Median = (5 + 6) /2
= 5.5
In The Presence Of Outliers
Q: Do outliers affect the Mean and Median?
Consider the list on numbers from 1 through 9 :
1, 2, 3, 4, 5, 6, 7 ,8 ,9
The Mean is : 5
The Median is : 5
What if we put the number 100 at the end of the list :
1, 2, 3, 4, 5, 6, 7 ,8 ,9, 100
The Mean is : 14.5
The Median is : 5.5
A: Outliers affect the Mean much more than the Median !
Distributions
The mean is the point at which a histogram balances. For
symmetric distributions the mean and median will be nearly
the same.
However, since the mean is influenced by outliers, for
skewed distributions the mean will be pulled in the
direction of the long tail while the median will be resistant
to the outliers and remain in nearly the same place.
Skewed Right
M
X
Skewed Left
X
M
Describing Spread
The Five Number Summary :
1) The Median
2) First Quartile : 25% of the observations lie
below the First Quartile
3) Third Quartile : 75% of the observations lie
below the third quartile
4) Lowest Individual Observation (Minimum)
5) Highest Individual Observation (Maximum)
Quartiles
Calculating the Quartiles :
1) Arrange the observations in increasing order and locate the Median
M in the ordered list o’ observations.
2) The First Quartile Q1 is the median of the observations whose
position in the ordered list is to the left of the location of the
overall median.
3) The Third Quartile Q3 is the median of the observations whose
position in the ordered list is to the right of the location of the
overall median.
Quartiles
Example of calculating First Quartile :
List of quiz scores: 10, 8, 9, 4, 6, 6, 8, 9, 2, 7
1) Order the list: 2, 4, 6, 6, 7, 8, 8, 9, 9, 10
Find the median: (7 + 8) / 2 = 7.5
2) Find all the observations whose position in the list is to the
left of the median : 2, 4, 6, 6, 7, 8, 8, 9, 9, 10
Find the median of these values : 6
Quartiles
Example of calculating Third Quartile :
List of quiz scores: 10, 8, 9, 4, 6, 6, 8, 9, 2, 7, 11
1) Order the list: 2, 4, 6, 6, 7, 8, 8, 9, 9, 10, 11
Find the median: 8
2) Find all the observations whose position in the list is to the
right of the median : 2, 4, 6, 6, 7, 8, 8, 9, 9, 10, 11
Find the median of these values : 9
Interquartile Range
The interquartile range , IQR, is the distance between the first
quartile and the third quartile.
Determining Outliers
Call an observation a suspected outlier if it falls more than
1.5 * IQR above the third quartile or below the first quartile.
Example : Imagine we have a bunch of test scores with Q1 = 50 and
Q3 = 80.
The IQR = 80 - 50 = 30
So, 1.5 * IQR = 1.5 * 30 = 45
This means that if there are any scores above Q3 + 45 = 125
or any scores Q1 - 45 = 5, then these scores are suspected outliers.
Boxplot
Example: Low = 47, High = 98, Median = 77, Q1 = 65, Q3 = 85
• A Boxplot is a graph of the five number summary. A central box
spans the quartiles, with a line marking the median. Whiskers
extend out from the box to the extremes. Highest Observation (98)
90
Q3 (85)
Median (77)
70
Q1 (65)
50
30
10
0
Lowest Observation (47)
Describing Spread
2. The Standard Deviation
• Variance: The variance of a set of observations is an “average” of the
deviations of the observations from the mean.
• Note: You divide by (n - 1) instead of n.
• Standard Deviation: The SD is the square root of the variance.
Describing Spread
The Standard Deviation
Example : Test Scores : 65, 77, 83, 80, 95
1) Find the average : 80
2) Find the deviations from the mean, and their squares
Obs
Deviation from Mean
65
77
-15
-3
83
80
95
3
0
15
Deviations Squared
225
9
9
0
225
Describing Spread
The Standard Deviation
3) Determine the mean of the squares:
(225 + 9 + 9 + 0 + 225)
(5 - 1)
= 117
Variance
4) Determine the Standard Deviation:
117
= 10.8
More Fancy Notation
2
The variance s of a set of observations is the average of the squares
of the deviations of the observations from their mean. In symbols,
the variance on n observations x 1 , x 2 , ... x n is :
2
s2 =
(x 1 - x )
2
2
+ (x 2 - x ) + ...
+ (x n - x )
n-1
or, in more compact notation :
s
2
=
1
n-1

2
(x i - x )
The standard deviation s is the square root of the variance s 2 :
s=
1
n-1

2
(x i - x )
Another Example of Standard Deviation
Consider the following years in our past :
1792, 1666, 1362, 1614, 1460, 1867, 1439
Find the standard deviation of these years.
The Mean = 1600
xi
1792
1666
1362
1614
1460
1867
1439
2
( xi - x )
( xi - x )
192
66
-238
14
-140
267
-161
36864
4356
56644
196
19600
71298
25921
s
2
=
1
n-1

2
(x i - x )
1
( 214879 )
=
6
= 35813.166
s = 189.2
Why Do We Square The Deviations ?
1) The sum of the squared deviations of any set of observations from their
mean is the smallest that the sum of squared deviations from any number
can possibly be.
Why use the Standard Deviation and not the Variance ?
1) The standard deviation is the measure of spread for an important
class of symmetric unimodal distributions called the normal distribution.
2) The standard deviation is used by the normal distribution.
3) The variance uses squared deviations, which gives a different unit
from the original data.
Why use n - 1 ?
1) The sum of the deviations is *always* zero. So, if we know n-1 of the
deviations, then the last deviation can be calculated. So, only n-1 of the
deviations can vary freely. These are called degrees of freedom.
Properties of Standard Deviations
1) The standard deviation measures spread about the mean and should be
used only when the mean is chosen as the measure of center.
2) s = 0 only when there is no spread. This happens only when all
observations have the same value. Otherwise, s > 0. As the observations
get more spread out from the mean, then s gets larger.
3) s, like the mean, is not resistant. A few outliers can make s very large.
Which Measure To Use ?
Q: When is the mean better than median? When is the five number
summary better than the standard deviation?
Rules Of Thumb
A1: If outliers appear, or if your distribution is skewed, then the mean
could be affected, so use the median and the five number summary.
A2: If the distribution is reasonably symmetric and is free of outliers,
then the mean and standard deviation should be used.
Changing Units
Consider the following values : 30, 40, 50, 60, 70
The mean is 50 and the standard deviation is 15.8
What happens to these if we take every score, multiply it by 2 and add 10
We get these values : 70, 90, 110, 130, 150
The mean is 110 and the standard deviation is 31.6
Changing Units
Old values : 30, 40, 50, 60, 70
mean = 50 and s = 15.8
What happens to these if we take every score, multiply it by 2 and add 10
New values : 70, 90, 110, 130, 150
mean = 110 and s = 31.6
150
150
150
130
130
130
110
110
110
90
90
90
70
70
70
50
50
50
30
30
30
Linear Transformations
A linear transformation changes the original variable x into the new
variable x new given an equation of the form :
x new = bx + a
Note: The constant a shifts all values of x either up or down by the value
a. The constant b changes the size of the unit of the distribution.
Effects of Linear Transformations
1) To get the new spread, multiply the old spread by |b|.
2) To get the new mean, multiply the old mean by b and add the
constant a.
1.3: The Normal Distributions
Density Curves
A density curve is a curve that :
1) is always on or above the vertical axis, and
2) has area exactly 1 underneath it.
A density curve describes the overall pattern of a
distribution. The area under the curve and above any range
of values is the relative frequency of all observations that
fall in that range.
Density Curves
Normal and Skewed Curves
Median
Mean
Why are Normal Distributions important
in stats?
1) Normal distributions are good descriptions for some
distributions of real data.
2) Normal distributions are good to the results of many kinds
of chance outcomes.
3) Many statistical inference procedures based on normal
distributions work well for other roughly symmetric
distributions.
The 68 - 95 - 99.7 Rule
In the normal distribution with mean  and standard deviation  :
• 68 % of the observations fall within  of the mean 
• 95 % of the observations fall within 2 of the mean 
• 99.7 % of the observations fall within 3 of the mean 
Normal Curve Example
John collected data on the heights of women ages 18 to 24.
He found that the distribution was roughly normal, with a mean
of 64.5 inches and a standard deviation of 2.5 inches.
Standardizing Observations
If x is an observation from a roughly symmetric distribution that has
mean  and standard deviation , then the standard value of x is :
z=
x-

Note: A standardized score is often called a z-score.
Example : Women’s IQ’s have a symmetric distribution with a
mean of 97 and a standard deviation of 6.
What is the standard score for a woman with an IQ of 106 ?
z=
106 - 97
6
=
9
6
= 1.5
Standardizing Observations
If x is an observation from a roughly symmetric distribution that has
mean  and standard deviation , then the standard value of x is :
z=
x-

Note: A standardized score is often called a z-score.
Example : Men’s IQ’s have a roughly symmetric distribution with a
mean of 72 and a standard deviation of 8.
What is the standard score for a man with an IQ of 66 ?
z=
66 - 72
8
-6
=
8
= - .75
The Standard Normal Distribution
If x is an observation from a roughly symmetric distribution that has
mean  and standard deviation , then the standard value of x is :
z=
x-

Note: A standardized score is often called a z-score.
Example : Men’s IQ’s have a roughly symmetric distribution with a
mean of 72 and a standard deviation of 8.
What is the standard score for a man with an IQ of 66 ?
z=
66 - 72
8
=
-6
8
= - .75
Q: What percentage of people have a score below 66 ?
The Standard Normal Table
Table A is a table of areas under the standard normal curve. The
table entry for each z value is the area under the curve to the left of z
The Standard Normal Table
Example : Imagine we have done an experiment, and we want to find
what percentage of people fell under a score, namely x.
We then proceed to find that the z-score for the value x is -1.10.
.1357
The Standard Normal Table
Example : The Graduate Record Examinations (GRE) are widely
used to help predict the performance of applicants to graduate schools.
The range of possible sores on a GRE is 200 to 900. The psychology
department at a university finds the scores of its applicants on the
quantitative GRE are approximately normal with mean  = 544 and
standard deviation  = 103. Answer the following :
1) Find the percentage of people who scored 700 or higher on the test.
2) Find the percentage of people who scored below 500 on the test.
3) Find the percentage of people who scored between 500 and 800
on the test.
1) Find the percentage of people who scored 700 or higher on the test.
Find the percentage to the right of the 700 marker.
1) Find the percentage of people who scored 700 or higher on the test.
Find the z-score : z =
700 - 544
156
=
= 1.51
103
103
.9345
.0655
P(X>700)=P(Z>1.51)=1-P(Z<1.51)=1 - .9345 = .0655
2) Find the percentage of people who scored below 500 on the test.
Find the percentage to the left of 500
2) Find the percentage of people who scored below 500 on the test.
Find the z-score :
z=
500 - 544
103
- 44
=
= - 0.43
103
0.3336
Answer : 0.3336
3) Find the percentage of people who scored between 500 and 800
on the test.
Find the percentage between 500 and 800
3) Find the percentage of people who scored between 500 and 800
on the test.
500 - 544
- 44
Find the first z-score : z =
=
= - 0.43
103
103
Find the second z-score : z =
800 - 544
103
256
=
= 2.49
103
0.9936
0.3336
Area =
.9936 - .3336 =
0.66
Example : The Soup Nazi charges, on the average, $4.50
for a cup of soup, and if you’re lucky, some bread, with a
standard deviation of $0.45.
4.50
What is the probability that our check will be more
than $5.00 ?
0.1335
0.8665
4.50
5.00
What is the probability that our check will be more
than $5.00 ?
P (X > 5 ) =P(Z >1.11)=0.1335
Z=
5.00 - 4.50
0.45
= 1.11
13.35 %
“Backward” Normal Calculations
• We could find the observed value (x) of a given
proportion in N( , ) by unstandardizing the zscore.
1) State the problem
2) Draw a picture
3) Use the normal table to find the proportion closest
to the one you need
4) Read off the z-value
5) Unstandardize x= + z
Example
Find the value of z such that the probability of
being less than z is 0.10.
1. z: P(Z < z) = .10
=1
.10
0
Example
Find the value of z such that the probability of being
less than z is .10.
1. z: P(Z < z) = .10
2.
.10
=1
0
3. In the body of the normal table, find the closest
value to .10. Once found, determine the z value.
Closest is .1003 So z = -1.28 P(Z < -1.28) = .1003
Example
Find the value of z such that the probability of being
greater than z is .33.
1. z: P(Z > z) = .33
2. z: P(Z < z) = 1 - .33 = .67
=1
???
.67
0
.33
Example
Find the value of z such that the probability of being
greater than z is .33.
1. z: P(Z > z) = .33
2. z: P(Z < z) = 1 - .33 = .67
=1
.67
.33
0
3. In the body of the normal table, find the closest value
to .67. Once found, determine the z value.
I found .6700
So z = .44
P(Z > .44) = .33
Example
X = time Americans stir sugar into their iced tea
X ~ N(12.3, 3.1) seconds
(1) Find the percent of Americans who spend between 20 to
22 seconds in stirring sugar into their iced tea?
i.e. P(20 < X < 22)
Example
X = time Americans stir sugar into their iced tea
X ~ N(12.3, 3.1)
Find P(20 < X < 22) = P(20 - 12.3 < Z < 22 - 12.3)
3.1
3.1
= P(2.48 < Z < 3.13)
= P(Z < 3.13) - P(Z < 2.48)
= .9991 - .9934
= .0057
Example
X = time Americans stir sugar into their iced tea
X ~ N(12.3, 3.1)
(2) About 18.4% of Americans spend more than how many
seconds stirring sugar into their iced tea?
i.e. Find the value of X such that the probability of being
greater than this value is .184.
(1) z: P(Z > z) = .184
(2) z: P(Z < z) = 1 - .184 = .816
(3) From the normal table, z = 0.90
(4) So x = +z = 12.3 + 0.90(3.1) = 12.3 + 2.79 = 15.09
The person would have to stir 15.09 seconds.
Example
X = IQ scores
X ~ N(112, 9)
Find the IQ score that replaces you in the top 2%
of all scores.
1. z: P(Z > z) = .02
2. z: P(Z < z) = 1 - .02 = .98
3. From the normal table, z = 2.05
x = +z = 112 + 2.05 (9) = 130.45
Exercise
The distribution of SAT Math scores is approximately
normally distributed with mean 500 and standard
deviation 100.
1. In what range do the middle 95% of all SAT Math
scores lie?
2. What proportion of SAT Math scores are
between 450 and 650?
3. If high school students having SAT Math scores
in the top 10% of all scores are eligible for a
certain scholarship, what is the lowest score a
person eligible for the scholarship can have?