Download 8a The Variance MEI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Taylor's law wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Student's t-test wikipedia, lookup

Transcript
“Teach A Level Maths”
Statistics 1
The Sample Variance
© Christine Crisp
The Sample Variance
Statistics 1
MEI/OCR
"Certain images and/or photos on this presentation are the copyrighted property of JupiterImages and are being used with
permission under license. These images and/or photos may not be copied or downloaded without permission from JupiterImages"
The Sample Variance
Can you find the medians and means for the following 3
data sets?
Median Mean,
Set A
Set B
Set C
1
1
1
2
1
5
3
1
5
4
4
5
5
5
5
6
6
5
7
9
5
8
9
5
9
9
9
5
5
5
x
5
5
5
Although the medians and means are the same, the data
sets are not really alike.
The spread or variability of the numbers is quite different.
How can we measure the spread within the data sets?
ANS: The range and inter-quartile range both measure
spread but neither uses all the data items.
The Sample Variance
Median Mean,
Set A
Set B
Set C
1
1
1
2
1
5
3
1
5
4
4
5
5
5
5
6
6
5
7
9
5
8
9
5
9
9
9
5
5
5
x
5
5
5
If you had to invent a method of measuring spread that
used all the data items, what could you do?
One thing we could do is find out how far each item is
from the mean and add up these differences. e.g.
Set A: x
x x
1 2 3 4
4 3 2 1
 ( x  x) 
5
0
6
1
7
2
8
3
9
4
x5
4 3 . . . + 3 + 4 = 0
Data sets B and C give the same result. The negative and
positive values have cancelled each other out.
The Sample Variance
To avoid the effect of the negative values we can either
•
•
ignore the negative signs, or
square each difference ( since the squares
will all be positive ).
Squaring is more convenient for developing theory, so, e.g.
Set A: x
xx
( x  x)2
1 2 3 4
4 3 2 1
16 9
4
1
5
0
0
6
1
1
7
2
4
2
(
x

x
)
 60

Let’s do this calculation for all 3 data sets:
8
3
9
9
4
16
The Sample Variance
Mean, x
Set A: x
Set B: x
Set C: x
Set A:
1
1
1
2
(
x

x
)
 60

2
1
5
3
1
5
4
4
5
Set B:
5
5
5
6
6
5
7
9
5
2
(
x

x
)
 98

8
9
5
9
9
9
5
5
5
Set C:
2
(
x

x
)
 32

The larger value for set B shows greater variability.
Set C has least variability.
Can you see a snag with this measurement?
ANS: The calculated value increases if we have more data,
so comparing data sets with different numbers of items
would not be possible.
To allow for this, we need to take n, the number of items,
into account.
The Sample Variance
There are 2 formulae that can be used,
msd 
or
s2 
2
(
x

x
)

n
2
(
x

x
)

n1
the mean square deviation.
the sample variance.
In many books you will find the word variance used for
the 1st of these formulae and you may have used it at
GCSE.
However, our data is nearly always a sample from a large
unknown set of data ( the population ) and we take
samples to find out about the population. The 1st formula
does not give the best estimate of the variance of the
population so is not used.
The Sample Variance
So, there are 2 quantities and their square roots that we
need to be clear about
msd 
and
rmsd 
Also
s2 
and
s
2
(
x

x
)

n
2
(
x

x
)

n
2
(
x

x
)

n1
2
(
x

x
)

n1
the mean square deviation,
the root mean square
deviation.
the sample variance,
the sample standard
deviation.
The Sample Variance
This all seems very complicated but help is at hand.
Both the quantities, rmsd and s are given by your
calculator.
e.g. Find the root mean square deviation, rmsd, and
the sample standard deviation, s, for the following data:
x
7
9
14
Use the Statistics function on your calculator and
enter the data. Select the list of calculations.
You will be able to find the following:
2  94392 . . . and 3  60555 . . .
Ignore the calculator notation.
The rmsd is smaller than s ( because we are dividing by
a larger number ). Correct to 3 s.f. we have
rmsd 
2
(
x

x
)

n
 2  94 ,
s
2
(
x

x
)

n1
 3  61
The Sample Variance
So, for the data
x
7
9
14
we have
rmsd 
2
(
x

x
)

n
 2  94 ,
2
(
x

x
)

s
n1
 3  61
Squaring these gives
msd 
 ( x  x)
n
2
 8  67,
( mean square deviation )
The part of the formula,
s2 
2
(
x

x
)

n1
 13
( sample variance )
2
(
x

x
)
, is in your formulae

booklet ( see correlation and regression ), labelled Sxx.
An expanded form of the expression is also given. All you
have to do is divide by the correct quantity.
The Sample Variance
e.g.1 For the following sample data, find
(a) the root mean square deviation, rmsd,
(b) the mean square deviation, msd,
(c) the sample standard deviation, s, and
(d) the sample variance s2.
x
12
15
14
9
Answer: Using the calculator functions,
(a)
(c)
rmsd  2  29 ( 3 s. f . ) (b) msd  5  25
s  2  65 ( 3 s. f . )
(d)
s2  7
The Sample Variance
e.g.2 The following summary data are given for a sample
of size 5:
Find
(a)
(b)
(c)
(d)
n  5,
the
the
the
the
2
(
x

x
)
 24

mean square deviation, msd,
root mean square deviation, rmsd ,
sample variance, s2
sample standard deviation, s , and,
Solution: Using the formulae book, S xx 
(a)
(b)
(c)
(d)
S xx
24
msd =

 48
5
n
rmsd = 4  8  2  19 ( 3 s. f . )
S xx
24
2
2
s 
 s 
6
n1
4
s  6  2  45 ( 3 s . f . )
2
(
x

x
)

SUMMARY
The Sample Variance
 The mean square deviation, msd, and sample variance,
both measure the spread or variability in the data.
 If we have raw data we use the statistical functions on
the calculator to find the rmsd or sample standard
deviation.
The sample standard deviation is the larger of these
quantities.
 To find the msd or sample variance, we square the
relevant quantity given by the calculator:
sample variance  s2
msd = (rmsd)2
 For summary data, we use the formulae book, choosing
2
the appropriate form:
S xx   ( x  x )   x 
2
2
 x 
n
Then, we divide by n for the msd or (n – 1) for s2.
The Sample Variance
Frequency Data
The formula for the variance can be easily adapted
to find the variance of frequency data.
 x 
2
S xx   ( x  x )   x 
2
2
becomes
n
 xf 
2
S xx   ( xf  x )   x f 
2
2
f
We only use the formulae if we are given summary
data. With raw data we enter the data into the
calculator and use the statistical functions to get
the answers directly.
The Sample Variance
e.g.1 Find the mean and sample standard deviation of
the following data:
x
Frequency, f
1
3
2
5
5
8
10
4
Solution:
Using the calculator functions, the mean, m = 4  65
sample standard deviation,
s  3  17 ( 3 s. f . )
Although we don’t need the formula for this question, let’s
check we have the correct value by using the formula:
The Sample Variance
e.g.1 Find the mean and sample standard deviation of
the following data:
x
Frequency, f
1
3
2
5
Solution:
5
8
10
4
 xf 
2
S xx   x f 
2
f
2
(
1

3

.
.
.

10

4
)
So, S xx  1 2  3  . . .  10 2  4 
20
 190  55
190  55
S xx
2

 10  029  s  3  17 ( 3 s. f . )
s 
19
n1
The Sample Variance
e.g.2 Find the sample standard deviation of the following
lengths:
Length (cm)
1-9
Frequency, f
2
10-14 15-19 20-29
7
Solution: We need the class mid-values
12
9
The Sample Variance
e.g.2 Find the sample standard deviation of the following
lengths:
Length (cm)
1-9
10-14 15-19 20-29
x
5
12
17
24·5
Frequency, f
2
7
12
9
Solution: We need the class mid-values
We can now enter the values of x and f on our
calculators.
Standard deviation, s = 5  77 ( 3 s. f . )
The Sample Variance
e.g.3 Find the mean and sample variance of 20 values of
x given the following:
 x  82 and
2
x
  370
Solution:
Since we only have summary data, we must use the formulae
sample mean,
x

x
n
 x 
82
 x   41
20
2
S xx   x 2 
 S xx
n
(82) 2
 370 
20
 33  8
S xx
sample variance, s 
 1  78 ( 3 s. f . )
n1
2
The Sample Variance
SUMMARY
 To find the root mean square deviation, rmsd, or the
sample standard deviation, s, using the calculator
functions,
•
•
the values of x ( and f ) are entered and
checked,
the table of calculations gives both values,
•
the larger value is the sample standard deviation, s
•
the variance is the square of the standard
deviation.
The Sample Variance
Exercise
Find the mean, sample standard deviation and sample
variance for each of the following samples, using calculator
functions where appropriate.
1.
2.
x
f
1
7
Time ( mins )
f
2
9
3
14
1-5
7
3. 10 observations where
4
12
6-10
9
5
8
11-15 16-20 21-25
14
12
8
 x  432 and
 x  18912
2
The Sample Variance
1.
x
f
1
7
Answer:
2
9
mean,
3
14
4
12
5
8
x  31
standard deviation, s = 1  28 ( 3 s. f . )
2
variance,
s
 64calculator value
N.B. To find s we need to use the 1full
for s, not the answer to 3 s.f.
2.
Time ( mins )
1-5
6-10 11-15 16-20 21-25
2
x
3
8
13
18
23
f
7
9
14
12
8
Answer:
mean, x  13 5
standard deviation, s = 6  41
( 3 s.f. )
2
variance, s  41  1 ( 3 s. f . )
The Sample Variance
3. 10 observations where  x  432 and
Solution:
x

x
 
mean,
 x 
x  43  2
n
2
S xx   x 2 
variance,
n
S xx
s 
n1
2
 x  18912
2


Standard deviation, s 
S xx
(432) 2
 18912
10
 249  6
s 2  27  7 (3 s.f. )
27  7
 5  27 ( 3 s.f. )
The Sample Variance
Outliers
We’ve already seen that an outlier is a data item that
lies well away from the other data. It may be a genuine
observation or an error in the data.
e.g. 1 Consider the following data:
10 12 14 17 19 21 81
With this data set, we would immediately suspect an
error. The value 81 was likely to have been 18. If
so, there would be a large effect on the mean and
standard deviation although the median would not be
affected and there would be little effect on the IQR.
The presence of possible outliers is an argument in
favour of using median and IQR as measures of data.
The Sample Variance
In an earlier section, we met a method of identifying
outliers using a measure of 1·5  IQR above or below the
median.
A 2nd method used to identify outliers is to find points that
are further than 2 standard deviations from the mean.
e.g. 2. Consider the following sample:
10 12 14 17 18 19 21 22 24 33
The sample mean and sample standard deviation are :
mean, x  19
standard deviation, s = 6  62 ( 3 s. f . )
So,
2 s  13  2
and
x  13  2  32  2
The point 33 is more than 2 standard deviations above
the mean so, using this measure, it is an outlier.
The following slides contain repeats of
information on earlier slides, shown without
colour, so that they can be printed and
photocopied.
For most purposes the slides can be printed
as “Handouts” with up to 6 slides per sheet.
The Sample Variance
There are 2 formulae that can be used to measure spread:
msd 
or
s2 
2
(
x

x
)

n
2
(
x

x
)

n1
the mean square deviation.
the sample variance,
In many books you will find the word variance used for
the 1st of these formulae and you may have used it at
GCSE.
However, our data is nearly always a sample from a large
unknown set of data ( the population ) and we take the
sample to find out about the population. The 1st formula
does not give the best estimate of the variance of the
population so is not used.
The Sample Variance
So, there are 2 quantities and their square roots that we
need to be clear about
msd 
and
rmsd 
Also
s2 
and
s
2
(
x

x
)

n
2
(
x

x
)

n
2
(
x

x
)

n1
2
(
x

x
)

n1
the mean square deviation
the root mean square
deviation.
the sample variance,
the sample standard
deviation.
The Sample Variance
e.g. Find the root mean square deviation, rmsd, and
the sample standard deviation, s, for the following data:
x
7
9
14
Use the Statistics function on your calculator and
enter the data. Select the list of calculations.
You will be able to find the following:
2  94392 . . .
3  60555 . . .
Ignore the calculator notation.
The rmsd is smaller than s ( because we are dividing by
a larger number ). Correct to 3 s.f. we have
rmsd 
2
(
x

x
)

n
 2  94 ,
s
2
(
x

x
)

n1
 3  61
The Sample Variance
Squaring these gives
msd 
2
(
x

x
)

n
 8  67,
( mean square deviation )
s2 
2
(
x

x
)

n1
 13
( variance )
Using the formulae:
If summary data are given, you will need to use the
formulae instead of the calculator functions.
The part of the formula,
2
(
x

x
)
, is in your formulae

booklet ( see correlation and regression ), labelled Sxx.
An expanded form of the expression is also given. All you
have to do is divide by the correct quantity, n or n  1.
SUMMARY
The Sample Variance
 The mean square deviation, msd, and sample variance,
both measure the spread or variability in the data.
 If we have raw data we use the stats functions on the
calculator to find the rmsd or sample standard deviation.
The sample standard deviation is the larger of these
quantities.
 To find the msd or sample variance, we square the
relevant quantity given by the calculator:
sample variance  s2
msd = (rmsd)2
 For summary data, we use the formulae book, choosing
2
the appropriate form:
S xx   ( x  x )   x 
2
2
 x 
n
Then, we divide by n for the msd or (n – 1) for s2.
The Sample Variance
e.g.1 For the following sample data, find
(a) the root mean square deviation, rmsd,
(b) the mean square deviation, msd,
(c) the sample standard deviation, s, and
(d) the sample variance s2.
x
12
15
14
9
Answer: Using the calculator functions,
(a)
(c)
rmsd  2  29 ( 3 s. f . ) (b) msd  5  25
s  2  65 ( 3 s. f . )
(d)
s2  7
The Sample Variance
e.g.2 Given the following summary of data for a sample of
size 5, find
(a) the mean square deviation, msd,
(b) the root mean square deviation, rmsd ,
(c) the sample variance s2
(d) the sample standard deviation, s , and,
n  5,
2
(
x

x
)
 24

Solution: Using the formulae book, S xx 
(a)
(b)
(c)
(d)
S xx
24
msd =

 48
5
n
rmsd = 4  8  2  19 ( 3 s. f . )
S xx
24
2
2
s 
 s 
6
n1
4
s  6  2  45 ( 3 s . f . )
2
(
x

x
)

Frequency Data
The Sample Variance
The formula for the variance can be easily adapted
to find the variance of frequency data.
 x 
2
S xx   ( x  x )   x 
2
2
n
becomes
 xf 
2
S xx   ( xf  x )   x f 
2
2
f
As before, we only use the formulae if we are given
summary data.
The Sample Variance
e.g.1 Find the mean and sample standard deviation of
the following data:
x
Frequency, f
1
3
2
5
5
8
Solution:
10
4
 xf 
2
S xx   ( xf  x )   x f 
2
2
f
1  3  . . .  10  4
So, S xx  1  3  . . .  10  4 
20
 190  55
S xx
190  55
2
s 

 10  039  s  3  17 ( 3 s. f . )
n1
19
2
2
The Sample Variance
e.g.2 Find the sample standard deviation of the following
lengths:
Length (cm)
1-9
Frequency, f
2
10-14 15-19 20-29
7
12
9
Solution: We need the class mid-values
x
5
12
17
24·5
Frequency, f
2
7
12
9
We can now enter the values of x and f on our
calculators.
Standard deviation, s = 5  77 ( 3 s. f . )
The Sample Variance
SUMMARY
 To find the root mean square deviation, rmsd, or the
sample standard deviation, s, using the calculator
functions,
•
•
the values of x ( and f ) are entered and
checked,
the table of calculations gives both values,
•
the larger value is the sample standard deviation,
s, and this is the value that is most often used by
statisticians,
•
the variance is the square of the standard
deviation.
The Sample Variance
Outliers
We’ve already seen that an outlier is a data item that
lies well away from the other data. It may be a genuine
observation or an error in the data.
e.g. 1 Consider the following data:
10 12 14 17 19 21 81
With this data set, we would immediately suspect an
error. The value 81 was likely to have been 18. If
so, there would be a large effect on the mean and
standard deviation although the median would not be
affected and there would be little effect on the IQR.
The presence of possible outliers is an argument in
favour of using median and IQR as measures of data.
The Sample Variance
In an earlier section, we met a method of identifying
outliers using a measure of 1·5  IQR above or below the
median.
A 2nd method used to identify outliers is to find points that
are further than 2 standard deviations from the mean.
e.g. 2. Consider the following sample:
10 12 14 17 18 19 21 22 24 33
The sample mean and sample standard deviation are :
mean, x  19
standard deviation, s = 6  62 ( 3 s. f . )
So,
2 s  13  2
and
x  13  2  32  2
The point 33 is more than 2 standard deviations above
the mean so, using this measure, it is an outlier.