Download Lecture Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
CBP Oct 2012.Computing Projects Descriptive Stats Notes
Descriptive Statistics
Introduction
Today we discussed a couple of scenarios where quantitative methods, and in particular
statistics can help us answer a research question.
The first scenario involved a begin situation, an intervention, and a final possibly
changed situation. For example, as shown below, a machine has been making “round
things” for several years, then it is serviced (the intervention). Following the service, 50
round things are made, and compared with the round things before the service, to see if
there is any change. Here’s a sketch of the before and after service round things,
intervention
What differences can we see? First the similarities! The machine is producing round
things of about the same size as before the intervention. But now there is less variation in
the size of the round things, the machine has apparently become more ‘accurate’. So we
suspect that the service has improved the machine, but there is the chance that this is not
the case; we may have accidentally chosen 50 good round things by chance. Before the
intervention, the machine had produced millions of round things, and its population of
round things was well-known. But our 50 round things are a sample, and it is not clear
that this sample is representative of the new population of round things. As stated, we
could have been really unlucky, and have accidentally chosen 50 ‘good; round things by
chance.
Here’s another example. The performance of pupils in secondary physics exams is well
known; these exams have been running for tens of years with thousands of pupils per
year. So their scores form a population. Some bright young teacher introduces some new
learning strategy or a piece of educational software, perhaps an educational game. To test
the effectiveness of the intervention, she records the scores of 30 pupils following the
intervention. Here’s the results.
6, 7, 10, 8, 4, 5, 7,
6, 7, 8, 10, 4 ….
intervention
9, 8, 7, 8, 8, 10, 7,
9, 6, 8, 7, 9 ….
CBP Oct 2012.Computing Projects Descriptive Stats Notes
Again, we ask the question, has there been any affect due to the teacher’s intervention?
At face value there has, since the test scores have apparently, on average, increased. But
again, the 30 pupils she tested may have been brighter than the average pupil tested in the
past, she just happened to select some brighter pupils by chance. In other words, the
sample was not representative of the population, and any conclusion that the intervention
had a positive effect may be flawed.
The work we are doing over the next couple of sessions addresses this issue, and presents
an approach where we can ‘become confident’ that an intervention has had an effect, so
we can say ‘I am confident that the chance of the observed effect was real is 95%, or
equally, that the chance the observed effect occurred by chance was 5%. That’s the best
statistics can do, it cannot prove that the observed effect was real, only increase our
confidence in the reality!
As an aside at this point, we asked ourselves how we actually made the comparison
between the ‘before’ and ‘after’ situations noted above. It is impossible to compare each
‘after’ number with each ‘before’, since the latter may involve millions of numbers. We
agreed that we could describe a population, or a sample using just two numbers instead of
millions. These were
(i) The average or ‘mean’ size or score
(ii) Some measure of the ‘spread’ of observed sizes around the mean.
We briefly discussed a second scenario, where we wished to compare two populations.
For example, scores on a physics test for the population of girls and for the population of
boys. Here’s the situation (note I’m not saying which group is boys or grils!)
7, 6, 8, 4, 6, 5, 7, 3,
8, 4, 6, 3, 8, 6, 4, 5, 6
mean? spread?
compare !
7, 9, 7, 8, 6, 9, 7, 7,
8, 6, 9, 7, 6, 9, 9, 7,
8, 9
mean? spread?
So the scores recorded above are four two samples of say 30 boys and 30 girls. We
assume that these samples are representative for the two populations of girls and boys of
this age studying physics. (Although there may be other factors, such as state or private
education). That’s another issue. Let’s look at the above sample scores. One sample is
apparently performing better than the other. So can we conclude that one gender is
performing better? Unfortunately no, since we may have been unlucky in our choice of
pupils, we may have chosen good students from one gender sample and bad students
from the other gender sample by chance. Again, we shall see in the next two sessions
how to understand the effects of chance, and how to gain confidence that we are
CBP Oct 2012.Computing Projects Descriptive Stats Notes
confident to a degree of 95% that the observed difference is real, ie that there is only a
5% chance that the difference occurred by pure chance.
Measures of Average and Spread.
To make these ideas a little more concrete, we discussed the notions of ‘average’ and
‘spread’ using shoe-size data collected from a sample of UW boys and girls. The data for
the girls is summarised in the branch-and leaf plot below:
7
8
4
4
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
4
5
6
7
8
The numbers in the bottom row are shoe-sizes, the x’s represent cases (people asked) and
the numbers on the top row are the totals for each shoe size. This plot shows the
frequency distribution for the sample taken.
One measure of ‘average’ is the mean, which is the sum of values divided by the number
of values. So for the above, we calculate
(7x4) + (8x5) + (4x6) + (4x7) + (1x8) divided by 24 = 5.3
Another measure of ‘average’ is the median which is the value in the middle when all the
values are arranged in order. So we have
44444445555555566667777 8
We have 24 data values, so the middle is at 12 (close-to), and the 12th data value is size 5,
so the median is 5. The median separates the data distribution into halves, so half the data
point are under the median and half are above. (The only complication is where we have
an even number of data points). So the average is a mean of 5.3 and a median of 5. These
values are close; we shall return to this later. Now what about the spread of values, how
can we characterise this in simple terms? One way is to identify the sizes corresponding
to the 25th and 75th ‘percentiles’. The 25th percentile indicates the lower 25% of the data
CBP Oct 2012.Computing Projects Descriptive Stats Notes
values. Since we have 24 values, then 25% (ie ¼) of this is 6. Now the 6th data value has
size 4, so 25% of the data values lie at or below this. The 75th percentile lies at ¾ of 24
which is 18. Now the 18th data value has shoe-size 6, so 25% of the data values are size 6
or above. The meaning of these percentiles becomes a little clearer, when we look at the
‘box-plots’ produced by SPSS below.
A much more fundamental measure of spread was introduced, known as standard
deviation. This is important, since it is based upon theoretical considerations (no, we’re
not going there, just trust me!). The idea here is to consider the deviation of each data
value from the mean value. Here’s a first attempt at a calculation. In this diagram, and all
subsequent discussions, we use the symbol x to refer to the raw data value, and the
symbol  (Greek letter ‘m’, pronounced ‘miu’) to refer to the mean. Let’s pretend   5
just to simplify the arithmetic. For each data value, we calculate the deviation x -  and
plot it in a table
x
4
4
…
5
…
6
…
7
8
(x  )
-1
-1
0
1
2
3
We want to define a number which indicates the average deviation from the mean. So we
could try to add up all the deviations, and divide by the number of deviations. But this
won’t work, since some deviations are positive and some negative, so these will cancel
and we shall lose information. So we calculate the ‘deviation squared’ which removes the
negatives, like this:
x
(x  )
4
4
…
5
…
6
…
7
8
-1
-1
( x   )2
1
1
0
0
1
1
2
3
4
9
CBP Oct 2012.Computing Projects Descriptive Stats Notes
Then we sum the squares of the deviations for all rows in the table, then divide by the
number of rows (data values) and then take the ‘square root’, to reverse the squaring. The
formula for doing this is

( x   ) 2
N
Fortunately, SPSS will do this for us, but I wanted you to understand what the calculation
is doing and why.
An Aside: Peas and Mushrooms
Distributions are all around us. We considered the size distributions of a can of peas and
a box of mushrooms. Photos are provided below. Clearly the pea sizes are smaller than
the mushrooms, so the average size of peas is smaller than mushrooms. It appears as
though the variation in mushroom size is smaller than of pea size. This is the opposite of
what we found in class. The samples observed in class came from the Tesco population,
while the pictures below are of a Lidl population. Note on the left we have split the peas
into a large population on the left and a smaller sample on the right. The question is, is
the distribution of the sample sizes representative of the population? What do you think?
How did I do on my OOP Exam?
Here we discussed a situation where a student had a score of 76 on his OOP exam. Was
this a good result? Of course we need to know what the maximum score was, well it was
100. So how did the student perform 76/100, sounds good! But that depends. It depends
on how other students scored, in other words we must take the mean score into
consideration. Say the mean score was 70. So how did the student perform – he got 6
CBP Oct 2012.Computing Projects Descriptive Stats Notes
points above average! Sounds good. But that’s not enough information, if many students
got 10 points above the mean, then he did not do so well. So we need to know the spread
of grades around the mean, which brings us back to the standard deviation (sd).
Let’s consider two cases for this student:
(1) Raw score x = 76. Mean score for class   70 and standard deviation (sd)   3 .
Then we see that his score above average (76 – 70 = 6) is 2 sds. We have seen above that
most scores are contained within 2 sds of the mean. So few scores exist beyond this. So
this student is in a good position, since fewer students have scored more than he has. We
saw this on a distribution plot of the grades shown below.
1.1
Probability Density
0.12
97.72%
1
0.9
0.8
0.1
0.7
0.08
0.6
0.5
mea
0.06
0.4
0.04
0.3
0.2
0.02
Cumulative Probability
0.14
0.1
0
0
20
40
60
80
0
100
The blue curve shows the distribution of scores for the given mean and sd for the test.
The shaded area shows that 97.72% of the students attained a score of 76 or below. In
other words 2.28% if the students scored higher. So this student can feel proud.
Let’s consider a slightly different scenario. The same student scored 76 when the mean
was 70 for the class but the class sd was 12. Now his score, relative to the mean is (76 –
70) = 6, but this is 6/12 sds or ½ sd. What does this mean? Well here’s the distribution
for this case:
CBP Oct 2012.Computing Projects Descriptive Stats Notes
0.035
1.1
1
0.9
0.8
0.025
0.7
71.23%
0.02
0.6
0.5
mea
0.015
0.4
0.01
0.3
0.2
0.005
Cumulative Probability
Probability Density
0.03
0.1
0
0
20
40
60
80
100
120
0
140
Same score, same mean, but a wider distribution. In this case we find that 71.23% of
student got this grade or less, so 20.77% of students scored higher. In this case, the
student would not be so proud.
The message is clear. It is a combination of mean and standard deviation which is
fundamental in our understanding the meaning of an individual data value in a
distribution, and there is something rather interesting about two standard deviations
above (and below) the mean!
Max Takes Networks and Web Design modules.
Max is a (not-so) hypothetical student who took two exams, in Nets and Webs. Here’s his
scores: Nets: raw score x = 60, where the class had   50,   10 . Web: raw score x =
56, where the class had   48,   4 . So how did Max perform relatively on these
exams? Here’s a couple of ‘simple-minded’ approaches:
(1) Very Naïve: Max scored 60 on Nets and 56 on Web so he did better on Nets. No-way!
(2) A bit better: Max scored (60 – 50) = 10 above average on Nets, but (56 – 48) = 8
above average on Web. So he did better on Nets. Mm… not really.
Now let’s try and be reasonable:
(3) Let’s look at the standard deviations .. perhaps we can find the magic 2?
His score on the Nets relative to the mean was 10, but the sd was 10, so he scored 1 on
the scale of sds. His score on the Web relative to the mean of 48 was 8 , but the sd was 4,
CBP Oct 2012.Computing Projects Descriptive Stats Notes
so he scored 2 on the scale of sds. Clearly, therefore Max is a Web person, since only
2.28% of the students outperformed him.
Aside: The Normal Distribution and Max’s Exams.
OK I’ve been a little bit sneaky in my presentation here; I’ve painted over many cracks,
just to get us moving along in understanding, and have sneaked in one major concept,
without identifying it in detail. It’s concerning the ‘two-sd’ principle. It’s now time to
come-clean about this. We did not discuss this material in class, but one student did raise
the issue. It’s something we shall look at next week in detail, but here’s a hint.
There is one particular frequency distribution which is important since it has a clear
mathematical description, and is amenable to mathematical analysis. Also, this
distribution does appear in actual sample measurements, so it is very useful to us. It is
called the normal distribution. Sketched below, it is a symmetrical distribution (with zero
skew) where the mean and the median are equal. In fact, this is the “unit normal”
distribution where the mean is 0 and the sd is 1.
0.45
1.1
1
97.72%
0.9
Probability Density
0.35
0.8
0.3
0.7
0.25
0.6
0.2
0.5
mea
0.4
0.15
0.3
0.1
0.2
0.05
0.1
0
-6
-4
-2
Cumulative Probability
0.4
0
0
2
4
6
The scale at the bottom is in fact the number of standard deviations, so that in the above
example, the x data value is set to 2 sd’s and we see the magic 97.72%!
We can perhaps understand our discussion of Max’s results: We said for the Web exam,
(above).
CBP Oct 2012.Computing Projects Descriptive Stats Notes
“His score on the Web relative to the mean of 48 was 8, but the sd was 4, so he scored 2
on the scale of sds”
What we were in fact doing was to transform his grade to this unit normal distribution.
First, in calculating his score “relative to the mean of 48” we subtracted the mean from
his score, ie (56 – 48) = 8. We can write this down as a formula
(x  )
Second, in saying “he scored 2 on the scale of sds” we were dividing the 8 by the sd (=4)
giving us the magic 2 (sds). So the above formula becomes
z
(x  )

where the z is called the z-score. This is relative to the unit normal distribution. Using this
formula to calculate Max’s scores we find for Networks
(60  50)
10
1
z
and for Web
(56  48)
4
2
z
This of course assumes that the distributions in both classes were normal. But we shall
see important cases when we can expect this to be the case. Next week.