Download Stat 1793

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Stat 1793
Section 3-3 Measures of Variation (variability, dispersion, spread)
How do the values in a data set vary amongst themselves? A data set with values
closer to each other have lower measures of variation than those sets with data spread
farther apart. Measures of central tendency do not tell the full picture.
Examples:
 A producer of time bombs aim for small variability – it would not be good to have
30 minute fuses have a range from 10 to 50 minutes
 A teacher interested in distinguishing between strong students and weak students
aims to design assessments with large variability in terms of results – it would not
be helpful if all students scored exactly the same mark.
 Two basketball players may have the same average points scored per game, but
with different ranges. If one player had a range from 0 to 52 points while the other
player had a range of 30 to 36 points, it says something about their consistency.
We are going to examine 4 measures of variation:
1.
2.
3.
4.
5.
Range
Variance
Standard Deviation
Coefficient of Variation
Interquartile Range (In a later section)
Range
 Difference between the greatest value and the least value in the data set
 It is the simplest, most easily calculated measure of variation and gives
some impression of variation
 Can be misleading (especially for large sets of data) as it is entirely
dependent on the two extreme values and is insensitive to those values in
between
 One use of the range is to evaluate samples with very few items. Some
quality control techniques involve taking periodic small samples and
basing further action on the range found in several such samples
 Cholesterol Data, Range = 221- 199=22
Variance
 The average of the squared differences of all the values from the mean
 The population variance is denoted 2 and is found by the
N
formula:  2 
 X
i 1
population mean
i
N
 
2
where N is the population size and  is the
2
 The sample variance is denoted s 2 and is found by the formula:
 X
n
s2 
i 1
i
X

2
where n is the sample size and X is the sample mean.
n 1
 Why do we divide by n-1 instead of n? We can prove that the use of n
tends to yield values of the sample variance that tend to underestimate the
population variation. To get larger values of the sample variance we
decrease the denominator to n-1. It is also related to degrees of freedom,
the number of values in a sample that must be assigned.
 Why do we use the squared distances instead of the actual distances? One
reason is that the sum of the actual distances will always be zero.
 There is also a quicker computational formula for the sample variance. It
 X
n
is: s 2 
i 1
i
X
n 1

2
 n

X



i
n
 i 1

2
Xi 

n
 i 1
n 1
2
Example
Consider the sample: 4, 2, 3, 5, 6. Calculate the sample variance.
20
4
5
X
Total
s2 
Xi
X i2
Xi  X
4
2
3
5
6
20
16
4
9
25
36
90
0
-2
-1
1
2
0
X
10
 2.5
4
20 2
5  10  2.5
4
4
90 
s2 
It appears that the short cut version saves no time at all, but be assured it does!
Particularly with large data sets or with means with a lot of decimals!
i
X
0
4
1
1
4
10

2
3
Example
Consider the sample: 5.2,4.2,3.1,3.6,4.7,4.8,4.1. Calculate the sample variance.
29.7
 4.242857
7
X
Total
s2 
X i2
Xi  X
5.2
4.2
3.1
3.6
4.7
4.8
4.1
29.7
27.04
17.64
9.61
12.96
22.09
23.04
16.81
129.19
.95714
-.04286
-1.14286
-.64286
.45714
.55714
-.14286
-.00002
Rounding error
X

X
.91612
.00184
1.3061
.41327
.20898
.3104
.02041
3.1771
i
2
3.1771
 .5295
6
129.19 
s2 
Xi
6
29.7 2
7  .52952381 , Much quicker!!
Standard Deviation
 Standard deviation is the square root of the variance.
 Why is it necessary to take the square root? The reason is that for the
variance, since the squared distances are used, the units of the resultant
numbers are the squares of the units of the original data. For the standard
deviation, finding the square root of the variance puts the units back into
the same units as that of the data set.
 Remember that standard deviation and variance are always positive
quantities!
 To roughly estimate standard deviation, one rule of thumb is to use
Range/4. Because of this, an observation is said to be “unusual” if it falls
farther away from the mean than 2 standard deviations.
3-2 Exercises page 104 – 107 # 2,3,4,8,17
4
Uses of the Variance and Standard Deviation
 Used to determine the spread of the data. If the variance is large, the data is more
dispersed. The information is useful in comparing two or more data sets to
determine which is more variable.
 Used to determine the consistency of the variable. For example, in the
manufacture of fittings, like nuts and bolts, the variation in the diameters must be
small, or the parts will not fit together.
 Used to determine the number of data values that fall within a specified interval in
a distribution. (Empirical Rule and Chebyshev’s Theorem)
 Used often in statistical inference, later in the course.
P.L. Chebyshev (1821 – 1894) Russian mathematician specified proportions of the spread
or dispersion of a variable in terms of the standard deviation.
Chebyshev’s Theorem
The proportion of values from a data set that will fall within k standard deviations of the
1
mean will be at least 1  2 , where k is a number greater than 1
k
For example when k=2, 1 
1 3
 , so at least 75 % will fall within 2 standard
k2 4
deviations of the mean.
When k=3, 1 
1 8
 , so at least 88.9 % will fall within 3 standard deviations of the
k2 9
mean.
Examples
1.
2.
The mean prices of houses in a certain neighbourhood is $50 000 and the
standard deviation is $10 000. Find the price range for which at least 75 % of
the houses will fall.
Answer: $30 000 to $ 70 000
A survey of local companies found that the mean amount of travel allowance
for executives is $0.25 per mile with a standard deviation of $0.02. What is
the minimum percentage of data values that fall between $0.20 and $0.30?
Answer: at least 84 %
0.20,0.30  .25  .05
.05  k 0.02
k  2.5
1
1
 1  .16  .84 , so at least 84% will fall between those limits.
2.5 2
5
Chebyshev’s Theorem applies to any distribution regardless of its shape. When a
distribution is bell-shaped (or normal) we can apply the following
Emprical Rule
For data sets having an approximately bell-shaped distribution,
 About 68 % of all values fall within 1 standard deviation of the mean
 About 95 % of all values fall within 2 standard deviations of the mean
 About 99.7 % of all values fall within 3 standard deviations of the mean
Do Exercise 33, 32 on page 109
Comparison of variation for 2 or more data sets is fine as long as the units
measured are the same for each data set. But what if someone wishes to compare
variation between two variables with different units of measurement? For example,
number of cars sold and sales commissions? For this we can use coefficients of variation.
The coefficient of variation is the standard deviation divided by the mean, and
then multiplied by 100. This is expressed as a percentage.
Example
The mean of the number of sales of cars over a three-month period is 87 and the standard
deviation is 5. The mean of sales commissions is $5225 and the standard deviation is
$773. Compare the variations of the two.
Solution:
The coefficients of variations are
5
CV  100%  5.7% number of car sales
87
773
CV 
100%  14.8% sales commissions
5225
Commissions are more variable than car sales.