* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture Notes
Survey
Document related concepts
Transcript
CBP Oct 2012.Computing Projects Descriptive Stats Notes Descriptive Statistics Introduction Today we discussed a couple of scenarios where quantitative methods, and in particular statistics can help us answer a research question. The first scenario involved a begin situation, an intervention, and a final possibly changed situation. For example, as shown below, a machine has been making “round things” for several years, then it is serviced (the intervention). Following the service, 50 round things are made, and compared with the round things before the service, to see if there is any change. Here’s a sketch of the before and after service round things, intervention What differences can we see? First the similarities! The machine is producing round things of about the same size as before the intervention. But now there is less variation in the size of the round things, the machine has apparently become more ‘accurate’. So we suspect that the service has improved the machine, but there is the chance that this is not the case; we may have accidentally chosen 50 good round things by chance. Before the intervention, the machine had produced millions of round things, and its population of round things was well-known. But our 50 round things are a sample, and it is not clear that this sample is representative of the new population of round things. As stated, we could have been really unlucky, and have accidentally chosen 50 ‘good; round things by chance. Here’s another example. The performance of pupils in secondary physics exams is well known; these exams have been running for tens of years with thousands of pupils per year. So their scores form a population. Some bright young teacher introduces some new learning strategy or a piece of educational software, perhaps an educational game. To test the effectiveness of the intervention, she records the scores of 30 pupils following the intervention. Here’s the results. 6, 7, 10, 8, 4, 5, 7, 6, 7, 8, 10, 4 …. intervention 9, 8, 7, 8, 8, 10, 7, 9, 6, 8, 7, 9 …. CBP Oct 2012.Computing Projects Descriptive Stats Notes Again, we ask the question, has there been any affect due to the teacher’s intervention? At face value there has, since the test scores have apparently, on average, increased. But again, the 30 pupils she tested may have been brighter than the average pupil tested in the past, she just happened to select some brighter pupils by chance. In other words, the sample was not representative of the population, and any conclusion that the intervention had a positive effect may be flawed. The work we are doing over the next couple of sessions addresses this issue, and presents an approach where we can ‘become confident’ that an intervention has had an effect, so we can say ‘I am confident that the chance of the observed effect was real is 95%, or equally, that the chance the observed effect occurred by chance was 5%. That’s the best statistics can do, it cannot prove that the observed effect was real, only increase our confidence in the reality! As an aside at this point, we asked ourselves how we actually made the comparison between the ‘before’ and ‘after’ situations noted above. It is impossible to compare each ‘after’ number with each ‘before’, since the latter may involve millions of numbers. We agreed that we could describe a population, or a sample using just two numbers instead of millions. These were (i) The average or ‘mean’ size or score (ii) Some measure of the ‘spread’ of observed sizes around the mean. We briefly discussed a second scenario, where we wished to compare two populations. For example, scores on a physics test for the population of girls and for the population of boys. Here’s the situation (note I’m not saying which group is boys or grils!) 7, 6, 8, 4, 6, 5, 7, 3, 8, 4, 6, 3, 8, 6, 4, 5, 6 mean? spread? compare ! 7, 9, 7, 8, 6, 9, 7, 7, 8, 6, 9, 7, 6, 9, 9, 7, 8, 9 mean? spread? So the scores recorded above are four two samples of say 30 boys and 30 girls. We assume that these samples are representative for the two populations of girls and boys of this age studying physics. (Although there may be other factors, such as state or private education). That’s another issue. Let’s look at the above sample scores. One sample is apparently performing better than the other. So can we conclude that one gender is performing better? Unfortunately no, since we may have been unlucky in our choice of pupils, we may have chosen good students from one gender sample and bad students from the other gender sample by chance. Again, we shall see in the next two sessions how to understand the effects of chance, and how to gain confidence that we are CBP Oct 2012.Computing Projects Descriptive Stats Notes confident to a degree of 95% that the observed difference is real, ie that there is only a 5% chance that the difference occurred by pure chance. Measures of Average and Spread. To make these ideas a little more concrete, we discussed the notions of ‘average’ and ‘spread’ using shoe-size data collected from a sample of UW boys and girls. The data for the girls is summarised in the branch-and leaf plot below: 7 8 4 4 1 x x x x x x x x x x x x x x x x x x x x x x x x 4 5 6 7 8 The numbers in the bottom row are shoe-sizes, the x’s represent cases (people asked) and the numbers on the top row are the totals for each shoe size. This plot shows the frequency distribution for the sample taken. One measure of ‘average’ is the mean, which is the sum of values divided by the number of values. So for the above, we calculate (7x4) + (8x5) + (4x6) + (4x7) + (1x8) divided by 24 = 5.3 Another measure of ‘average’ is the median which is the value in the middle when all the values are arranged in order. So we have 44444445555555566667777 8 We have 24 data values, so the middle is at 12 (close-to), and the 12th data value is size 5, so the median is 5. The median separates the data distribution into halves, so half the data point are under the median and half are above. (The only complication is where we have an even number of data points). So the average is a mean of 5.3 and a median of 5. These values are close; we shall return to this later. Now what about the spread of values, how can we characterise this in simple terms? One way is to identify the sizes corresponding to the 25th and 75th ‘percentiles’. The 25th percentile indicates the lower 25% of the data CBP Oct 2012.Computing Projects Descriptive Stats Notes values. Since we have 24 values, then 25% (ie ¼) of this is 6. Now the 6th data value has size 4, so 25% of the data values lie at or below this. The 75th percentile lies at ¾ of 24 which is 18. Now the 18th data value has shoe-size 6, so 25% of the data values are size 6 or above. The meaning of these percentiles becomes a little clearer, when we look at the ‘box-plots’ produced by SPSS below. A much more fundamental measure of spread was introduced, known as standard deviation. This is important, since it is based upon theoretical considerations (no, we’re not going there, just trust me!). The idea here is to consider the deviation of each data value from the mean value. Here’s a first attempt at a calculation. In this diagram, and all subsequent discussions, we use the symbol x to refer to the raw data value, and the symbol (Greek letter ‘m’, pronounced ‘miu’) to refer to the mean. Let’s pretend 5 just to simplify the arithmetic. For each data value, we calculate the deviation x - and plot it in a table x 4 4 … 5 … 6 … 7 8 (x ) -1 -1 0 1 2 3 We want to define a number which indicates the average deviation from the mean. So we could try to add up all the deviations, and divide by the number of deviations. But this won’t work, since some deviations are positive and some negative, so these will cancel and we shall lose information. So we calculate the ‘deviation squared’ which removes the negatives, like this: x (x ) 4 4 … 5 … 6 … 7 8 -1 -1 ( x )2 1 1 0 0 1 1 2 3 4 9 CBP Oct 2012.Computing Projects Descriptive Stats Notes Then we sum the squares of the deviations for all rows in the table, then divide by the number of rows (data values) and then take the ‘square root’, to reverse the squaring. The formula for doing this is ( x ) 2 N Fortunately, SPSS will do this for us, but I wanted you to understand what the calculation is doing and why. An Aside: Peas and Mushrooms Distributions are all around us. We considered the size distributions of a can of peas and a box of mushrooms. Photos are provided below. Clearly the pea sizes are smaller than the mushrooms, so the average size of peas is smaller than mushrooms. It appears as though the variation in mushroom size is smaller than of pea size. This is the opposite of what we found in class. The samples observed in class came from the Tesco population, while the pictures below are of a Lidl population. Note on the left we have split the peas into a large population on the left and a smaller sample on the right. The question is, is the distribution of the sample sizes representative of the population? What do you think? How did I do on my OOP Exam? Here we discussed a situation where a student had a score of 76 on his OOP exam. Was this a good result? Of course we need to know what the maximum score was, well it was 100. So how did the student perform 76/100, sounds good! But that depends. It depends on how other students scored, in other words we must take the mean score into consideration. Say the mean score was 70. So how did the student perform – he got 6 CBP Oct 2012.Computing Projects Descriptive Stats Notes points above average! Sounds good. But that’s not enough information, if many students got 10 points above the mean, then he did not do so well. So we need to know the spread of grades around the mean, which brings us back to the standard deviation (sd). Let’s consider two cases for this student: (1) Raw score x = 76. Mean score for class 70 and standard deviation (sd) 3 . Then we see that his score above average (76 – 70 = 6) is 2 sds. We have seen above that most scores are contained within 2 sds of the mean. So few scores exist beyond this. So this student is in a good position, since fewer students have scored more than he has. We saw this on a distribution plot of the grades shown below. 1.1 Probability Density 0.12 97.72% 1 0.9 0.8 0.1 0.7 0.08 0.6 0.5 mea 0.06 0.4 0.04 0.3 0.2 0.02 Cumulative Probability 0.14 0.1 0 0 20 40 60 80 0 100 The blue curve shows the distribution of scores for the given mean and sd for the test. The shaded area shows that 97.72% of the students attained a score of 76 or below. In other words 2.28% if the students scored higher. So this student can feel proud. Let’s consider a slightly different scenario. The same student scored 76 when the mean was 70 for the class but the class sd was 12. Now his score, relative to the mean is (76 – 70) = 6, but this is 6/12 sds or ½ sd. What does this mean? Well here’s the distribution for this case: CBP Oct 2012.Computing Projects Descriptive Stats Notes 0.035 1.1 1 0.9 0.8 0.025 0.7 71.23% 0.02 0.6 0.5 mea 0.015 0.4 0.01 0.3 0.2 0.005 Cumulative Probability Probability Density 0.03 0.1 0 0 20 40 60 80 100 120 0 140 Same score, same mean, but a wider distribution. In this case we find that 71.23% of student got this grade or less, so 20.77% of students scored higher. In this case, the student would not be so proud. The message is clear. It is a combination of mean and standard deviation which is fundamental in our understanding the meaning of an individual data value in a distribution, and there is something rather interesting about two standard deviations above (and below) the mean! Max Takes Networks and Web Design modules. Max is a (not-so) hypothetical student who took two exams, in Nets and Webs. Here’s his scores: Nets: raw score x = 60, where the class had 50, 10 . Web: raw score x = 56, where the class had 48, 4 . So how did Max perform relatively on these exams? Here’s a couple of ‘simple-minded’ approaches: (1) Very Naïve: Max scored 60 on Nets and 56 on Web so he did better on Nets. No-way! (2) A bit better: Max scored (60 – 50) = 10 above average on Nets, but (56 – 48) = 8 above average on Web. So he did better on Nets. Mm… not really. Now let’s try and be reasonable: (3) Let’s look at the standard deviations .. perhaps we can find the magic 2? His score on the Nets relative to the mean was 10, but the sd was 10, so he scored 1 on the scale of sds. His score on the Web relative to the mean of 48 was 8 , but the sd was 4, CBP Oct 2012.Computing Projects Descriptive Stats Notes so he scored 2 on the scale of sds. Clearly, therefore Max is a Web person, since only 2.28% of the students outperformed him. Aside: The Normal Distribution and Max’s Exams. OK I’ve been a little bit sneaky in my presentation here; I’ve painted over many cracks, just to get us moving along in understanding, and have sneaked in one major concept, without identifying it in detail. It’s concerning the ‘two-sd’ principle. It’s now time to come-clean about this. We did not discuss this material in class, but one student did raise the issue. It’s something we shall look at next week in detail, but here’s a hint. There is one particular frequency distribution which is important since it has a clear mathematical description, and is amenable to mathematical analysis. Also, this distribution does appear in actual sample measurements, so it is very useful to us. It is called the normal distribution. Sketched below, it is a symmetrical distribution (with zero skew) where the mean and the median are equal. In fact, this is the “unit normal” distribution where the mean is 0 and the sd is 1. 0.45 1.1 1 97.72% 0.9 Probability Density 0.35 0.8 0.3 0.7 0.25 0.6 0.2 0.5 mea 0.4 0.15 0.3 0.1 0.2 0.05 0.1 0 -6 -4 -2 Cumulative Probability 0.4 0 0 2 4 6 The scale at the bottom is in fact the number of standard deviations, so that in the above example, the x data value is set to 2 sd’s and we see the magic 97.72%! We can perhaps understand our discussion of Max’s results: We said for the Web exam, (above). CBP Oct 2012.Computing Projects Descriptive Stats Notes “His score on the Web relative to the mean of 48 was 8, but the sd was 4, so he scored 2 on the scale of sds” What we were in fact doing was to transform his grade to this unit normal distribution. First, in calculating his score “relative to the mean of 48” we subtracted the mean from his score, ie (56 – 48) = 8. We can write this down as a formula (x ) Second, in saying “he scored 2 on the scale of sds” we were dividing the 8 by the sd (=4) giving us the magic 2 (sds). So the above formula becomes z (x ) where the z is called the z-score. This is relative to the unit normal distribution. Using this formula to calculate Max’s scores we find for Networks (60 50) 10 1 z and for Web (56 48) 4 2 z This of course assumes that the distributions in both classes were normal. But we shall see important cases when we can expect this to be the case. Next week.