Download ch3.2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Transcript
3.2 Measure of Dispersion:
Example 2:
Suppose there are two factories producing the batteries. From each factory, 10
batteries are drawn to test for the lifetime (in hours). These lifetimes are:
Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1
Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12.
The mean lifetimes of the two factories are both 10. However, by looking at the data,
it is obvious that the batteries produced by factory 1 are much more reliable than the
ones by factory 2. This implies other measures for measuring the “dispersion” or
“variation” of the data are required.
(I) Range:
range=(largest value of the data)-(smallest value of the data).
Example 2 (continue):
Range of lifetime data for factory 1=10.1-9.9=0.2
Range of lifetime data for factory 2=16-3=13
 The range of battery lifetimes for factory 1 is much smaller than the one for
factor 2.
Note: the range is seldom used as the only measure of dispersion. The
range is highly influenced by an extremely large or an extremely small
data value.
(II) Interquartile Range:
Interquartile is the difference between the third and the first quartiles. That is,
IQR  Q3  Q1 .
Example 2 (continue):
The first quartile and the third quartile for the data from factory 1 are 9.9 and 10.1,
respectively, and 6 and 14 for the data from factory 2. Therefore,
IQR (factory 1)=10.1-9.9=0.2
IQR (factory 2)=14-6=8.
1
 The interquartile of battery lifetimes for factory 1 is much smaller than the one
for factor 2.
(III) Variance and Standard Deviation:
yi   , i  1,2,, N
population deviation about the mean:
sample deviation about the mean:
xi  x , i  1,2, , n
Intuitively, the population deviation and the sample deviation can measure how far the
data is from the “center” of the data. Then, population variance and sample
variance are the sum of square of the population deviation and sample deviation,
N
2 
 y
i 1
 
2
i
N
and
n
s2 
 x
i
i 1
n
 x
2

n 1
x
 nx 2
2
i
i 1
n 1
,
respectively. The population standard deviation and sample standard deviation are the
square root of population variance and sample variance:
  2
and
s  s2
,
respectively.
Large sample variance or sample standard deviation implies the data are “dispersed”
or are highly varied.
n
n
Note:
n
n
 x  x    x  nx   x  n
i 1
i
i 1
i
i 1
2
i
x
i 1
n
i
n
n
i 1
i 1
  xi   xi  0
Example 2 (continue):
s
2
s
2
2
2
2

10.1  10  9.9  10    10.1  10
( factory.1) 
10  1
2
2
2

16  10  5  10    12  10
( factory.2) 
10  1
 0.0111
 21.1111
 The sample variance of battery lifetimes for factory 2 is 1900 times larger than
the one for factor 1.
The sample standard deviation for the data from factories 1 and 2 are
0.01111  0.1054
and
21.1111  4.5946 ,
respectively.
(IV) Coefficient of Variation:
The coefficient of variation is another useful statistic for measuring the dispersion of
the data. The coefficient of variation is
C .V . 
s
 100
x
The coefficient of variation is invariant with respect to the scale of the data. On the
other hand, the standard deviation is not scale-invariant. The following example
demonstrates the property.
Example 2 (continue):
In the battery data from factory 1, suppose the measurement is in minutes rather than
hours. Then, the data are 606, 594, 606, 594, 594, 606, 594, 606, 594, 606.
Thus, the standard deviation becomes 6.3245 which is 60 times larger than the one
0.1054 based on the original data measured in hours. However, no matter the data are
measured in hours and minutes, the coefficient of variation is
C.V . 
0.1054
6.3245
100 
100  1.054.
10
600
Note: since the coefficient of variation is scale-invariant, it is very useful
3
for comparing the dispersion of different data. For example, in the
previous battery data, if the lifetime of the batteries from factory 1 and
factory 2 are measured in minutes and hours, respectively, the standard
deviation for factory 1, 6.3245, would be larger than for factory 2, 4.5946.
However, the coefficient of variation for factory 1, 1.054 is still much
smaller than the one for factory 2, 45.946.
Example 3:
The amount of time (in minutes) that a sample of students spends watching television
per day is given below.
40
25
35
30
20
40
30
40
10
30
20
10
5
20
20
(a) Compute the mean
(b) The standard deviation.
(c) The coefficient of variation.
(d) The 40th percentile.
(e) The mode.
(f) The interquartile range.
(g) Construct a frequency distribution, a cumulative frequency distribution and a
relative frequency distribution. Let the first class be 1-10.
[solution:]
(a)
15
x
x
i 1
15
i

40  25    5  20
 25
15
(b)
15
s
 x
i 1
i
 x
15  1
2

40  252  25  252    5  252  20  252
14
 11.339
(c)
C.V . 
s
11.339
 100 
 100  45.356 .
x
25
(d)
1. The data are
5
10
10
20
20
20
20
30
30
30
35
40
40
40
4
25
2.
15 
40
6
100
Thus,
20  20
 20
2
is the 40th percentile.
(e)The mode is 20.
(f) Since
Q1  20, Q3  35 ,
IQR  Q3  Q1  35  20  15 .
(g)
Class
Frequency
1-10
11-20
21-30
31-40
3
4
4
4
Cumulative
Frequency
3
7
11
15
Online Exercise:
Exercise 3.2.1
Exercise 3.2.2
5
Relative
Frequency
0.2
4/15
4/15
4/15