Download Practical 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Mean field particle methods wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Module I3 Sessions 4&5
Practical 1: Numerical summaries for
quantitative data
Introduction
Some students may be concerned that their fear of maths formulae will prevent them
understanding certain statistical ideas. This used to be a valid concern, because data had to
be processed by hand, and this required an understanding of the formulae.
Now you will use a spreadsheet, or a statistical package for the calculations. So courses can
be more practical and no longer need to include formulae.
BUT, if you could understand and use the simplest formulae, then it will give you more
confidence in using the computer for your analyses. And you may be surprised at how
much you already know!
1. The mean
Here is the entry in the SIAC Glossary for the mean.
The mean is a measure of the “middle”, sometimes called the “average”.
It is often given the symbol. x .
To calculate the mean, sum all the data values and divide by the number of
them. For example with the 7 values 12, 15, 11, 18, 13, 14, 18, the mean
is:
x  (12  15  11  18  13  14  18) / 7  14.4
As a formula, the mean is given by x   x / n ,
where  is short for “the sum of”, “x” signifies that each value is taken
in turn, and n is the number of observations.
To check you are comfortable with this formula:

Add the 7 numbers above by hand (i.e. without even using a calculator!) to get the
total, i.e. Σx and then divide by 7 to give the mean as shown above.
Ans: x = ______ hence mean = ______

Go into Excel and enter the 7 numbers into a column. Then (with the cell beneath
as the active one) click on the Σ symbol in the toolbar to give the sum, see below.
In another cell, divide this total by n=7 to give the mean.
SADC Course in Statistics
Module I3 Sessions 4&5 – Page 1
Module I3 Sessions 4&5

Instead of clicking on Σ, press the small arrow next to it, to give alternative
functions. Use this to give the mean directly, as well as the count, minimum and
maximum.
The Σ function in Excel
More simple statistical functions in Excel
2. More practice with the mean
No:
Question
Answer
1
Calculate x = x/n for the first n positive numbers, when n = 6.
(Hint: 1 is the first positive number…)
2
Now (with the calculation performed above in mind) calculate the
mean of the numbers
121, 122, 123, 124, 125, 126 without using a calculator.
3
How did you do it? (See some options below)
a) The new numbers are 120 more than 1, 2, … 6, so I just added 120 to the x
calculated in the first question.
b) I subtracted 120 from each number to give 1, 2, … 6. So, that must add 120 to the
mean.
c) I checked from the formula that x = (120+x)/n
= ((120 +1) +(120+2)+ …)/6
= 120 +(1+2+3+4+5+6)/6
= 120 + original mean
SADC Course in Statistics
Module I3 Sessions 4&5 – Page 2
Module I3 Sessions 4&5
3. The median
Here is the entry in the SIAC glossary about the median. Read it and complete the
questions below.
The median is the "middle value" of a list. If the list has an odd number of
entries, the median is the middle entry after sorting the list into increasing
order. If the list has an even number of entries, the median is halfway
between the two middle numbers after sorting.
For example with the same 7 values shown for the mean and maximum
above the sorted data are as follows:
11
12
13
14
15
18
18.
The median is therefore the 4th value in the sorted list, i.e. 14.
No.
Question
1
Calculate the median of the n = 6 numbers, above, i.e.
the median of
121, 122, 123, 124, 125, 126.
2
Change the largest value, 126 to 246 by adding 120 to it.
Does the median change? If so how much?
3
Does that make the mean change?
4
If you think the mean changes, then work out the new
value, if possible without doing the calculation again?
How did you work out the mean?
(See the table of alternatives below)
5
Answer
a) I did the calculation again using my hand calculator.
b) I did the calculation again using Excel or a statistics package.
c) I noticed that adding 120 to the last number simply adds 120 to the total, and that’s 20
per number. So it must add 20 to the mean.
d) I thought it would add 20 to the mean, but wasn’t sure. So I checked using my
calculator.
e) I used my calculator/Excel/Stats Package and then noticed it added 20 to the previous
value. Then I understood the question better!
SADC Course in Statistics
Module I3 Sessions 4&5 – Page 3
Module I3 Sessions 4&5
4. A different measure of spread
The mean deviation is sometimes given by the formula
mean deviation   x  x /  n 1 ,
(Here the  symbol is called the “modulus” or “absolute value”, so throw
away the minus, for negative values.)
The mean deviation is, as is shown by the formula, an average (mean)
difference from the mean.
Calculate the mean deviation for the same 6 values used above.
121 122 123 124 125 126
Ans: _______.
(Note: some statistics packages, e.g. Instat, have an option to calculate the mean deviation.)
5. Dividing by (n-1)
Why divide by (n-1) in the formula above?
Here are some possible answers. Tick all those that help to explain the reason to you:
a) You start with n pieces of information. You use one piece of information to calculate
the mean, (which you need first). So you have only (n-1) pieces of information left to
calculate the mean deviation. The number of pieces of information to calculate the spread,
is called the “degrees of freedom”.
b) If you only have one observation, e.g. 121, you can give the mean, (trivially) but cannot
calculate any spread. If you have 2 observations, then the only information about spread is
the difference. So you have one piece of information less to calculate the spread.
c) It would be simpler to divide by n, but statisticians always like to complicate things.
d) If you divide by n, then
 x  x / n is like the formula  x / n
for the mean. So I can
see why it is called the “mean deviation”. It is still roughly the same formula when we
divide by n-1.
SADC Course in Statistics
Module I3 Sessions 4&5 – Page 4
Module I3 Sessions 4&5
6. The variance and standard deviation of the data
The variance is a measure of variability, and is often denoted by s2.
The variance, s2 is given by the formula
s 2    x  x  /  n  1
2
a. Calculate the variance for the same values, i.e.
121, 122, 123, 124, 125, 126.
Ans: _______
So the variance is roughly the mean value of  x  x  .
2
The standard deviation (s.d.) is a commonly used summary measure of
variation or spread of a set of data. It is a “typical” distance from the mean.
Usually, about 70% of the observations are closer than 1 standard deviation
from the mean and most (about 95%) are within 2 s.d. of the mean.
The standard deviation is a symmetrical measure of spread, and hence is less
useful and more difficult to interpret for data sets that are skew. It is also
sensitive to (i.e. its value can be greatly changed by) the presence of outliers
in the data.
With the data values
12 15 11 18 13 14 18
the variance, s2 was calculated as 7.62, so the standard deviation:
s = √7.62 = 2.8.
The mean, x was 14.4, so ( x - s) = (14.4 – 2.8) = 11.6
and ( x + s) = 17.2
So 4 of the 7 observations are within one standard deviation of the mean,
while the other three are outside.
b. Simply put, the standard deviation is s, which is the  (square root) of the
variance.
Calculate the standard deviation, s.
Ans: ______
c. Is the standard deviation larger than the mean deviation?
SADC Course in Statistics
Ans: Yes/No
Module I3 Sessions 4&5 – Page 5
Module I3 Sessions 4&5
d. Do you think this will almost always be the case?
a) Yes.
b) No, sometimes it will be larger, and sometimes smaller.
c) I am not sure – it would be good to check with more examples.
d) Yes, because when you square (in calculating the variance) you give the big deviations
even more importance.
7. Using Excel for these formulae
a. Go into Excel and follow these instructions.
No: Instruction
Comment
1
2
3
4
5
6
Type n in Cell A1, and x in cell B2
Enter the numbers 1, 2, … 7 below n. Cells (A2-A8)
Enter 12, 15, 11, 18, 13, 14, 18 below x. Cells (B2:B8)
In cells (A10:A13) type Sum, (n-1), mean, stdev
In B10 use the Σ function or type =SUM(B2:B8)
In B11 use the ▼by the Σ, or type =COUNT(B2:B8)-1
Naming the variables
7
8
In B12 use the ▼again, or type =AVERAGE(B2:B8)
In B13 use fx and then STDEV or type
=STDEV(B2:B8)
Decrease the number of decimals to 2 in B12 and B13
9
Answer
This is Σx
This is (n-1)
This is x .
=
=
The standard deviation
=
=
Tidying
Now, to reinforce the formulae, the standard deviation is to be calculated from first
principles. This has the bonus that you will calculate the mean deviation at the same time.
Excel version of this is called AVEDEV but this divides by n instead of n-1 so we prefer to
calculate it ourselves.
b. Follow the instructions below:
No: Instruction
1
Type (x-mean) in Cell D1, │x-mean│in E1
and (x-mean)2 in F1
2
Enter the formula =(B2-B$12) in D2. Copy down to
D8
3
Enter the formula ABS(D2) into E2. Copy down to E8
4
In E10 use the Σ function or type SUM(E2:E8)
SADC Course in Statistics
Comment
Naming the variables
Answer
Note the $ to fix D12
The absolute deviations
Sum of these deviations
Module I3 Sessions 4&5 – Page 6
Module I3 Sessions 4&5
5
6
7
8
9
10
11
12
In E11 calculate E10/B11
Cut the decimals back to 2 so the deviations and results
are clearer
In D11 type mdev to remind you what is calculated
Enter the formula D2*D2 into F2. Copy down to F8
In F10 use the Σ function or type SUM(F2:F8)
In F11 calculate F10/B11
In F13 calculate SQRT(F11)
Compare F13 with B13 to check they are the same
Mean deviation
Also do this below
The deviations squared
This is the mean.
The variance
The standard deviation
(Hint: if they are not check all your values are in the
right cells)
This exercise has been to re-enforce the use of the formulae. They can help you to
understand the summary statistics. It has also shown that once you know a formula you
can use it in Excel from first principles to construct a new summary. In earlier sessions
you saw that you could construct a new graph – a (jittered) dot plot if you understood the
concept. This is the parallel idea for summary statistics.
8. The coefficient of variation
The coefficient of variation, sometimes called the cv, is a summary statistic
that Excel does not have a function for, but it is easy to calculate.
The formula is cv = 100*stdev/mean
Its appeal s that it is dimensionless, you do not need to know the units of
measurement for it have meaning.
It measures the variation in a set of data (stdev), as a fraction of the mean and
expressed as a percentage.
a. Type cv in cell A14 and, in B14, calculate the cv for the data in B2:B8.
No summary statistic should be used unless it helps you and the reader to interpret the
data. The cv is an overused summary statistic and is sometimes used when it is not a
sensible summary to calculate.
SADC Course in Statistics
Module I3 Sessions 4&5 – Page 7
Module I3 Sessions 4&5
b. Looking at the formula, what are the situations when the cv would not be
sensible? (See also the hint below.)
Hint: Copy all the values from column B (i.e. B1 to B14) into column H, so you can play
without affecting the main data. Then change the last value of 18 to -18. You find that the
cv is now 132%.
There is nothing intrinsically wrong with a cv of over 100%, but calculating the cv with
negative values (even zeros) is starting to look odd.
c. You can make it a nightmare, by replacing the 14 by -44 (say), so the mean is
now 1. What is the cv now?
d. And even more exciting would be to use -51! Why is that exciting?
The use of the coefficient of variation (cv) is also discussed in the presentation.
SADC Course in Statistics
Module I3 Sessions 4&5 – Page 8