Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Analysis using Excel: A great deal of information can be obtained by using statistical features available in a program such as Excel. The “Data Analysis” facility of Excel is available in the drop-down menu under Tools (usually at the bottom). (If it is not available in the drop-down menu from Excel (probably near the bottom of the list) you will have to add it in by clicking on “Add-In”. Tick the top “AnalysisToolpak” and this will make “Data Analysis” available in the drop-down menu.) Suppose we copy data as follows into Excel (this is made-up data but might represent the number of insects found on 7 rose bushes in particular conditions). 1 2 3 4 5 6 7 Number of Insects 5 6 7 5 7 9 4 To get a variety of data analysis material, from Tools, Select Data Analysis Click on Descriptive Statistics If you had not first highlighted the data, highlight it as the Input Range (Note: Highlight only the column of actual values) If you include the heading with the data tick this box for Label Output range should be the cell where you want to display the data, e.g A10 Click on Summary Statistics You should get something like: (You may need to make column A wider. This can be done by double-clicking on the line between A and B at the top of the spreadsheet) Number of Insects Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 6.142857143 0.633530224 6 5 1.67616342 2.80952381 0.051881643 0.582443916 5 4 9 43 7 The mean, median and mode are, as described, measures of centre. Range, Minimum and maximum give information on the spread of the values. Sum gives the sum of the 7 values and count gives the number of data values. Note: Check a few simple values carefully. It is easy to get wrong information from Excel, and is one of the reasons you need to think carefully about what you put into a statistics package. It is easy to include the label and get the following data: 5 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 6.333333333 0.714920353 6.5 7 1.751190072 3.066666667 0.014177694 0.248278366 5 4 9 38 6 The first data item has been counted as a label, and the count and heading give an indication that this is a problem. However if you just trust the computer, you will be performing analysis with wrong values. Standard error gives a measure of the spread of how much the mean of the population may vary from sample to sample. See notes on the next page Standard deviation and Sample variance both give a measure of the spread of the population data values. Sample variance is just the square of the sample deviation. See notes below. Kurtosis is a measure of how peaked the distribution is, and is not needed for this course. Skewness is a measure of whether there is a long tail of data on one side or other of the centre, and is not needed for this course. Standard error Standard error is the standard deviation of the sample means. Where we have had 7 values and produced a sample mean of 6.14, if we took a different sample of 7 we would not expect to get the same sample mean of 6.14 again. It would be slightly different. If we kept taking samples of 7 from the same population, we would find that if we did a histogram of all the means, they would fit a Normal distribution. They would be centred at the same place the population mean is centred, but they would not be as spread out as the original population. Because we are averaging 7 values, the effect of any extreme values tend to be cancelled out by other values and we are likely to get a much narrower range of values. For example finding a rose bush with 9 insects would not be unexpected, but finding 7 bushes with a mean number of insects of 9 (that is finding 63 insects in total on the 7 rose bushes) would be unexpected. The bigger the sample size we take the smaller the deviation is likely to be. In fact the s standard deviation of all the sample means is where s is the population standard n deviation (well, our estimate of it, since we really are trying to predict this) and n is the number of samples taken. Therefore the standard error found in the descriptive Statistics could have been calculate by dividing the standard deviation by the square root of the count 1.676 = 0.633 7 In real life we do not take a lot of samples of size 7. We take one sample and from this we calculate the sample mean and the sample standard deviation. Using this information with the sample size, we make predictions about the whole population. Suppose our sample came from a market garden with a large number of rose bushes from which we randomly selected 7 to count insects. We found the sample mean was 6.14 and the sample deviation was 1.676. We use this to predict that the mean number of insects per rosebush for the whole population of rose bushes is 6.14. We predict that the population standard deviation is 1.676. If any sample of 7 rose bushes is taken we predict that the mean of this sample would be 6.14. If a large number of samples of size 7 were taken, we think the means would fit a normal distribution with mean 6.14 and standard deviation of 0.633. Insects per rose bush Mean number of insects per rose bush for samples of 7 rose bushes Confidence Intervals Often we want to predict what the mean value of a particular population is. The confidence interval gives a range of values within which we expect the true mean number of insects present in the set conditions will be. In Excel, this can be found using a similar process as described for Descriptive Statistics, but as well as ticking Summary Statistics, tick Confidence Interval This will add the bottom line to the previous summary. Number of Insects Mean Standard Error : : : ::: Confidence Level(95.0%) 6.142857143 0.633530224 1.550193746 Using this, we can predict that 95% of the time the mean number of insects is between 6.14 - 1.55 and 6.14 + 1.55 That is, the mean number of insects on rose bushes is between 4.59 and 7.69 It is possible with other analysis to find the confidence interval for the data values themselves or for proportions or the difference between the means of two populations. Understanding how this calculation is made, when the sample size is less than 30, requires statistics at University level. However a brief explanation is given here. When we have a large enough sample size (about 30 or more), we use the Normal distribution to create the predicted range. The Normal tables provided in statistics books give the probability of a value being between 0 and a number. For example, the number 2.3 gives a table value of 0.4893. This means that the probability that a data value is between the mean and 2.3 standard deviations above the mean is 0.4893. In Excel = NORMSDIST(2.3) gives 0.989276 which is the probability that a data value is less than (the mean + 2.3 standard deviations). That is, it includes the 0.5 that are below the mean too. That is, tables give: Excel gives P(0 < Z < 2.3) = 0.4893 P(Z< 2.3) = 0.9893 z z 2.3 2.3 Note that because the Normal distribution is symmetrical, that this one probability value also tells us: P(-2.3< Z <0) = 0.4893 P(Z > 2.3) = 0.5- 0.4893 = 0.0107 P(Z < 2.3) P(Z < -2.3) P(Z > -2.3) P(-2.3 < Z < 2.3) = 0.5 + 0.4893 = 0.9893 = 0.0107 = 0.9893 = 2 x 0.4893 = 0.9786 The number 1.96 on Normal tables gives probability 0.4750 meaning the probability that a value is between the mean and 1.96 the standard deviation is 0.475. The probability that it is between 1.96 standard deviations below the mean and the mean is also 0.4750. Therefore the probability that a data value is within 1.96 standard deviations of the mean is 0.4750 + 0.4750 = 0.95. From Excel =NORMSDIST(1.96) gives 0.975002 meaning 2.5% of values will be above a z value of 1.96 Hence we expect that 95% of the time values will lie within 1.96 standard deviations of the mean. So the 95% confidence interval for a data value is from mean – 1.96 x std dev to mean + 1.96 x std dev. Thus if we know that the number of insects on rose bushes is a Normal distribution with mean 6.14 and standard deviation 1.676, we expect that 95% of the time the number of insects on any randomly selected rose bush would be between: 6.14 – 1.96 x 1.676 to 6.14 + 1.96 x 1.676 = 2.855 to 9.425 We use the sample mean and the sample standard deviation to predict the value of the population mean and the population standard deviation. If the sample is of about size 30 or more, this is usually a reasonable prediction. t-distribution When the sample size is less than 30, the values provided in the Normal tables (or using the NORMSDIST function on Excel) are too small. Because we are unsure of the population values we have to take a wider interval to be 95% confident. The smaller the sample size, the larger the number of standard deviations we have to be away to include the true value 95% of the time. To find how many standard deviations we need to take, we look up t- tables (instead of Normal tables). On the computer this means using the TINV value. To use these, however, the number of degrees of freedom (1 less than the sample size, in general) has also to be given. Normal distribution t distribution, allowing for a smaller sample size making the 95% confisence band much wider z = 1.96 In our case, we had a sample size of 7 so we have 6 degrees of freedom. (Refer also to page 21). In Excel we would use =TINV(0.05,6) to get the value 2.446914 as the number of standard deviations that we need to be away. The 95% confidence interval means that there is a 5% chance that we could get a value outside this range. In a two-tailed confidence interval, this means we want 2.5% above the higher value and 2.5% below. In t-tables, we therefore look up the 97.5 percentile with 6 degrees of freedom to get the value 2.447. From this, we can calculate the 95% confidence interval for the number of insects on a rose bush to be between 6.14 – 2.447 x 1.676 and 6.14 + 2.447 x 1.676 = 2.0388 and 10.2412 Note: the confidence Level (95%) provided in Excel is 1.550193746, whereas the above confidence level is 2.447 x 1.676 = 4.101172. This is because the value given is the confidence level for the mean number of insects in samples of size 7, not the actual number and 1.676 2.447 x = 1.55 7 If we want a 95% confidence interval for the mean of the number of insects per bush in a sample of 7 rose bushes, rather than use the standard deviation, we use the standard error. Therefore we calculate the 95% confidence interval for the sample mean for insects per bush to be: 1.676 1.676 6.14 – 2.447 x and 6.14 + 2.447 x 7 7 = 4.59 and 7.69 Note: it is also possible to set up confidence intervals for differences between the means of two populations, where standard error is found 2 2 s1 s 2 where s1 is the standard deviation of one population, s2 is the n n standard deviation of the other and n is sample size, degrees of freedom = df = (n-1) when data is paired or 2(n-1) if data is not paired (that is there is no link in each row, such as a before and after treatment or twins, etc) as Refer to next section on difference between means proportion of an organism that will display a particular behaviour, where standard p(1 p) error is where p is the proportion displaying the property and n is n the sample size, df = n – 1. Difference between two means Suppose we want to compare some treatments. For example some process has been carried out to attract insects and the results of observations are: Before 5 6 7 5 7 9 4 After 7 9 12 10 8 6 6 No matter whether there has been an effect or not, we would expect there to be some differences between the results before and after simply because the results were taken independently and at different times. We want to know if there is a significant difference between the results and look at the difference between means. Our Descriptive Statistics package gives: Before Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%) After 6.14285714 0.63353022 6 5 1.67616342 2.80952381 0.05188164 0.58244392 5 4 9 43 7 1.55019375 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%) 8.285714286 0.837066468 8 6 2.214669706 4.904761905 -0.42320671 0.65757466 6 6 12 58 7 2.048229359 But this does not tell us if there is any significant difference between the two sets of data. There are on average more insects after treatment, but could this just have occurred by chance? We can get a confidence interval for this difference in means. Mean difference found = 8.285714286 - 6.14285714 = 2.142857 Standard error = 0.63353022 2 0.837066468 2 = 1.049781 t-value with 12 degrees of freedom and 95% confidence is 2.179 Therefore 95% of the time we expect the true difference to be between 2.143 – 2.179 1.050 and 2.143 – 2.179 1.050 = -0.144 and 4.43 Since 0 lies in this interval the true difference could be 0 so there is not enough evidence to say that this difference is significant. t-test Another way to determine if there is a significant difference is to do a t test. Using the function =TTEST(B2:B8,C2:C8,2,2) we can get the probability of this difference occurring by chance. Note: This function is available in Excel by Insert from drop down menu, then Function and choose TTEST. This will guide you through what you need to enter. In this case it produces the value 0.063851 so there was a 6% chance that this difference occurred simply by chance. Traditionally in Statistics we consider 5% to be the critical value, so we would report that there was not a significant difference between the before and after treatment in the number of insects. The t test involves calculation of the mean differences between the two sets. In setting up a confidence interval, the test involves determining the probability that the difference we obtained would be this far from 0. In =TTEST(B2:B8,C2:C8,2,2) the 4 sets of information in brackets are: B2: B8 showing the cells that hold the first set of data C2:C8 showing the cells that hold the second set of data. 2 We then determine the number of tails – 2 in this case. That is, we are interested to determine if the two treatments are not equal. If we had only been interested in whether the treatment had increased the number of insects we would have used a one-tail experiment. The 5% chance of making a wrong prediction (when we think of 0.05 as the critical value) is all at one tail end now instead of having 2.5% at each end. (In fact, in this case if we had used the one-tailed test =TTEST(B2:B8,C2:C8,1,2) we would have had a result of 0.031925 (and this would have been a significant result.) 95% of the time we expect the true mean difference is in this interval Two-tailed with 2.5% at each end 95% of the time we expect the true mean difference to be greater than the lower bound One-tailed test with 5% of values at one end. 2 The final 2 is an indication of the type of data we are using. If the data was paired (that is there was a link between data recorded in 1 as , say being both on Day 1 of a 7 day experiment) we would have entered 1 for this type. If we have data that comes from populations where the variance of each population is different, we would have entered 3 for this type of test. More details can be obtained using the Data Analysis method on Excel: t-Test: Paired Two Sample for Means in place of Descriptive Statistics on page 23 This produces t-Test: Paired Two Sample for Means Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Before 6.142857143 2.80952381 7 0.032069736 0 6 2.073490549 0.041741629 1.943180905 0.083483258 2.446913641 After 8.285714286 4.904761905 7 The “t Stat” value shows that the sample difference is 2.07 standard errors above the mean. If a one-tailed test is used, the probability of this occurring by chance is 0.0417 because 95% of the time the difference would be less than the critical value 1.94. Therefore if we are only interested in whether the “After” treatment is greater than the “Before” treatment, we have reason from this data to conclude that this is true. If a two-tailed test is used – that is we are interested if they are different, the critical tvalues are -2.447 and +2.447. Since this sample difference corresponds to a t-value of 2.07, which is in the 95% confidence interval, we cannot conclude that there is any significant difference. The actual confidence interval can be calculated using: Two - tailed confidence interval: Difference 2.142857143 Std error diff 1.049781318 t value 2.178813 Lower bound -0.14442 Upper bound 4.430134 found by subtracting means found by taking the square root of the sum of the variances divided by 7 found by TINV(0.05,12) as 95% confidence and 12 degrees of freedom difference - t value * std error diff difference + t value * std error diff 95% confidence interval is from -0.14442 to 4.430134 Conclusion: Since 0 is inside this interval we do not have enough evidence to show that there is a significant difference. ANOVA This fancy sounding title just stands for ANalysis Of Variance. This is used where two or more sets of data are being compared for differences of means For the above data, using Excel this is produced using From the Tools menu select “Anova: Single Factor” On the options available, enter: Input range as B1:C8 (or wherever data entered) Grouped by select Columns Labels in 1st column in this box tick Output range Chose cell A18 (say) Click OK You should get the following (without any bolding): Anova: Single Factor SUMMARY Groups Before After Count 7 7 ANOVA Source of Variation Between Groups Within Groups SS 16.0714286 46.2857143 Total 62.3571429 Sum 43 58 Average 6.142857143 8.285714286 Variance 2.809524 4.904762 df MS 16.07142857 3.857142857 F 4.166667 1 12 P-value 0.063851 F crit 4.747221 13 Data that is useful for reporting the results of an experiment from this is highlighted. The mean number of insects before treatment and the mean number after treatment are given under the Average column. To give the standard deviation of these, the square root of the variance can be calculated. The p-value (the same as calculated by the t test because there were only 2 treatments being compared) is the probability that such a difference happened by chance. SS means sum of squares and relates to calculations of the variances (which are just the result of adding up squared deviations from the mean) df gives the degrees of freedom (and it is not essential to understand these for this course. There are 2 treatments and 1 is subtracted to give the degrees of freedom for treatments) MS is the mean squared deviations (SS divided by df) F is a ratio of the variance between groups to the variance within groups MS between ( ). MS within F crit is the value above which any F value would be considered to mean that there was a significant difference. As can be seen in this case it is just below so the result is said to be not significant. It is more likely that you will use ANOVA with a greater number of treatments. e.g Temperature: Batch 1 Batch 2 Batch 3 Batch 4 5oC 14 15 18 11 10oC 12 14 8 17 15oC 8 4 3 6 20oC 10 8 5 7 25oC 15 12 7 16 producing: Anova: Single Factor SUMMARY Groups Count 5 10 15 20 25 4 4 4 4 4 ANOVA Source of Variation Between Groups Within Groups SS 211.7 115.5 Total 327.2 Sum 48 35 21 40 60 df 4 15 Average 12 8.75 5.25 10 15 Variance 10 6.25 4.916667 12.66667 4.666667 MS 52.925 7.7 F 6.873377 P-value 0.002375 F crit 3.055568 19 In this case your conclusion would be that temperature has a significant effect on whatever is being measured. The p-value is well below 1% so is very highly significant. This does not tell you what the actual effects are, though, so you would need to look at a graph or compare pairs of treatments to determine any pattern or relationship. Chi squared test The most common application for chi-squared is to compare observed counts of particular cases with the expected counts. If these counts are less than 5, this test does not work well. Example: The results of 80 slaters turning have been recorded. 36 slaters were in the control group and were not forced to turn previously and 52 were forced to turn left before getting a choice of direction to turn. Forced left turn No turn (control) This has totals: Number turning left 38 16 Number turning right 14 20 54 turning left and 34 turning right 88 slaters in total If these events had occurred completely randomly, we would have expected: Forced left turn No turn (control) 2= Number turning left 54 88 52 =31.9 54 88 Number turning right 34 88 52 =20.1 36 = 22.1 34 88 36 =13.9 (38 31.9) 2 (14 20.1) 2 (16 22.1) 2 (20 13.9) 2 = 7.378 31.9 20.1 22.1 13.9 The 2 critical value with (2-1)(2-1) = 1 degree of freedom is 3.84 (found on Excel using = CHIINV(0.05,1) which gives 3.84145534) The number of degrees of freedom is found by (number in row of data – 1) (number in column of data results – 1) Therefore if we had 3 choices of Forced left turn, No turn, Forced right turn, the number of degrees of freedom would have been (3 - 1) (2 – 1) = 2 Since the value we have calculated (7.378) is greater than 3.84, we can conclude that whether a slater turns left or right is not independent of whether it was forced to turn left last time or not. In Excel, the calculations have to be made of expected value and stored in cells in the same way that the observed values are recorded. This can be done using cell references, so that no mistakes are made in rewriting numbers. A function =CHITEST can then be entered by giving the range of cells where the observed data is, and the range of cells where the expected data is. This will return the probability of such data happening randomly if the two factors being looked at are independent. Example: Control Medicine 1 Medicine 2 Total Control Medicine 1 Medicine 2 Immediate 3 Recovery Rate Within 2 days Within 4 days 8 15 Not 24 Total 50 6 15 8 3 32 4 7 17 18 46 13 30 40 45 128 Immediate 5.078125 Recovery Rate Within 2 days Within 4 days 11.71875 15.625 Not 17.578125 3.25 7.5 10 11.25 4.671875 10.78125 14.375 16.171875 Each expected value is calculated by dividing the column total by the bold total and multiplying by the row total. For example, to calculate the expected value in the cell for control group immediate recovery enter in the cell = (click into cell with Total immediate) / = (click into cell with Total Total) * = (click into cell with Total Control) giving = 13/128* 50 = 5.078125 Chi square result: 0.000870912 (found by entering = CHITEST (B3:E5, B11:E13)) Note: The slater turning could be analysed showing calculations of (Obs – Exp) as: Actual Forced Control Left 38 16 54 Right 14 20 34 Expected Forced Control Left 31.90909 22.09091 Right 20.09091 13.90909 52 36 88 (Obs - Exp)sqd Exp (Obs - Exp)sqd / Exp 37.09917 31.90909 1.162652 37.09917 20.09091 1.846565 37.09917 22.09091 1.679386 37.09917 13.90909 2.667261 0.006684 3.841455 0.006684 probability found by =CHITEST(B2:C3,B7:C8) the critical value of Chi squared, found using CHIINV (0.05,1) probability found by =CHIDIST(H12,1) where H12 contains the sum. Sum: 7.355865 A warning with statistical interpretation: Lurking variables Statistics can only be as good as the person who designs the experiment and considers all possible influences. If the experimenter does not consider the effect of other variables, wrong conclusions can be drawn. Example 1: If the length of a boy’s trousers is plotted against their reading level, it may appear that the longer the trousers, the better they can read. In reality the height increase has largely been brought about with age increase and so has the reading level, so a child’s age not being considered can cause misconceptions. Age is the lurking variable. Just because there is a relationship between two things does not mean that one is the cause of the other. Reading level Reading ability and trouser length 10 9 8 7 6 5 4 3 2 1 0 0 20 40 60 80 100 120 Trouser length (cm) A conclusion that you can improve your reading ability by lengthening your trousers would be a sad consequence of this!! Example 2: The following graph would lead to the conclusion that there is no relationship between moisture and wheat production. Yield of wheat (tonnes / ha) Wheat production and moisture 5.9 5.7 5.5 5.3 5.1 4.9 4.7 4.5 2 2.2 2.4 2.6 Moisture (mm/day) 2.8 3 3.2 However, when the lurking variable temperature is considered, the results can be interpreted differently. Moisture effect on Wheat yield Wheat yield (tonnes/ha) 5.9 5.7 5.5 5.3 5.1 4.9 4.7 4.5 2 2.2 2.4 2.6 2.8 3 3.2 Moisture (mm/day) Sunny Mild Cold Example 3: Consider a new treatment being applied to a group of randomly selected patients. Results for the group are: All Patients Improved: Not improved: Percentage improved: New treatment: 20 20 50% Standard treatment: 24 16 60% TOTAL 44 36 55% It appears that the new treatment is not as effective as the standard treatment (60% improved using the standard treatment and only 50% improved using the new treatment). However when this group is broken down by gender, the following is found: Male patients Improved: Not improved: Percentage improved: New treatment: 12 18 40% Standard treatment: 3 7 30% TOTAL 15 25 38% Female Patients: Improved: Not improved: Percentage improved: New treatment: 8 2 80% Standard treatment: 21 9 70% TOTAL 29 11 73% It is clear that the new treatment has had a higher success rate for both males and females. The problem here is that females have a much higher rate of improvement than men, and most women were given the standard treatment. Gender is a lurking variable and the gender imbalance has caused a problem that the statistical analysis could not pick up. This is called Simpson’s Paradox