Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Vertebrate Zoology Statistics Assignment Due Date: ______________________ The goals of this assignment are to: 1. Introduce or reinforce basic statistical and data analysis skills. 2. Apply critical thinking to the refined data to answer questions based on the evidence collected. One of the goals of a scientist is to be able to answer questions with the greatest possible reliance on observable facts, and the least reliance on intuition. While intuition has great importance in finding the right questions to ask, and in finding ways of investigation, once the data is gathered the scientist should rely on the facts at hand. Patterns in the data may be revealed through good graphical analysis, and the patterns should then be tested with statistics to see if they are “real” – or simply the result of the scientist looking at the data and “seeing” a preconceived result. This is an example of bias; one shields oneself from bias by using commonly agreed upon statistical tests as impartial arbitrators of what is “real” Sometimes the results are unambiguous. Every time you drop a penny it falls to the ground. No one needs statistical analysis to prove the existence of gravity. On the other hand, sometimes the penny lies heads up, sometimes heads down. Determining if this is a random event or something influenced by other factors may require the application of statistics; statistics are also useful to draw conclusions about a larger population by sampling a smaller portion of it. Results in biology are seldom so clear-cut as to eliminate the need for statistics. There are several basic tests and graphical analyses that should be in every biologist’s “toolkit”. Among the graphing techniques are: 1. The scatterplot, which is used to look for correlation between two variables, or to track a variable over time. 2. The trendline, which is the superposition of a line drawn from a mathematical model over a scatterplot. 3. The histogram, which is used to look for patterns in abundance. Any pattern that is revealed by the graphical analysis should be examined by statistical tests to see if the pattern is “real”. In most cases, this means determining if the pattern is different enough from what might be expected in a random world. For instance, flipping 51 heads out of 100 tosses would not be unexpected; flipping 80 heads out of 100 tosses, or flipping 20 heads in a row might be unexpected and suggest that something else is at work. The statistical tests that will be of the most use to you in testing apparent patterns are: 1. 2. 3. 4. The t-test, which is used to tell if two averages (the composite of many measurements) differ in a statistically significant way. The correlation coefficient, which is used to test the statistical significance of a trendline. The Chi-square test, which is used to determine if experimental results differ enough from expected results to suggest “real” difference. The ANOVA test, which is kind of a “super” t-test to tell if any of a group of mean values differs from the rest. If the results are positive, you then have to go back with multiple t-test and see which mean or means is different In this exercise, you will calculate basic descriptive statistics. summary, conduct t-tests and an ANOVA test. D:\769817416.doc Page 1 of 9 You will also generate a statistical Last printed 4/30/2017 1:13:00 AM Biological Background (do some research to fill in the blanks): Eastern Box Turtles, Terrapene _______________________ are a primarily terrestrial species in the mostly aquatic turtle family _______________________. In the northern portion of their range, these turtles must hibernate through the winter. Typically, they do this by burrowing into a bank or the forest floor, trying to get themselves below the frostline, if possible. Hibernation is an essential part of the life cycle; during hibernation hormonal levels are “reset” and the breeding season follows after emergence from hibernation. Box turtles in captivity must be hibernated to maintain their health. The box turtles at Marietta College were hibernated in a rooftop hibernaculum during the winter of 1998-1999. The hibernaculum was instrumented with a Vernier Software MLI (multiple lab interface) package of Direct-Connect Thermometer probes (DCT’s) connected to a Gateway 486 computer (which in turn was linked to the college network). Data from two periods of the winter hibernation are found in the file Hiber.xls. Graphed, the data you will be examining looks like this: Time vs. Temperature 20 A 15 B o Temperature ( C) Among other things, you will be examining this data and determining which probe was in the hibernaculum, which probe was exposed to the outside air, and where the third probe was located. You will also be looking at the data and determining if any differences in the average temperatures recorded by the 3 probes are statistically significant. 10 5 0 C -5 0 24 48 72 96 120 144 168 Time (hours) Step 1: Researching background information. Go to the library and find the following information: 1. The scientific name of the eastern box turtle, and the family it is placed in. 2. The range of the eastern box turtle – turn in a photocopy of the map. 3. Other ecological information, such as clutch size, longevity, body size, rate of growth, diet, predators, etc. D:\769817416.doc Page 2 of 9 Last printed 4/30/2017 1:13:00 AM 192 Step 2 – Descriptive Statistics In the computer lab, you should go to Excel and open up the file Hiber.xls, which is located in the K:\Classes\Vertebrate directory (in the Bartlett lab). Save the file to the C: drive of the computer you are working on. Go to the sheet labeled Run 3. The first rows of data should look something like this: Time (Seconds) Time (Hours) Temperature 1 Temperature 2 Temperature 3 0 0 18.204 11.827 7.39 83.333 0.023148056 17.407 11.297 7.758 166.667 0.046296389 17.399 11.296 7.771 250 0.069444444 17.415 11.295 8.664 333.333 0.0925925 17.422 11.298 9.429 416.667 0.115740833 17.425 11.302 9.948 500 0.138888889 17.425 11.314 11.235 583.333 0.162036944 17.403 11.269 11.024 666.667 0.185185278 17.39 11.256 12.271 750 0.208333333 17.367 11.271 13.227 833.333 0.231481389 17.373 11.254 12.719 1. 2. 3. 4. 5. 6. 7. 8. 9. Select Tools:Data Analysis from the menu. Select Descriptive Statistics from the list that is presented, and click OK. The Descriptive Statistics Wizard will come up. Fill it out in a similar way to the one presented here: In the Input Range enter the cells where your data can be found. You can click on the small box to the right to go to the spreadsheet and highlight your data. Only highlight the 3 temperature columns; do not highlight the time columns! Include the column headings. Note – there are 7,260 rows of data! You might find it easiest to click on the cell at the top left of the area you want to select, scroll to the bottom using the scrollbar at the right of the screen, and click on the bottom right cell while holding down the shift key. Be sure to click the Labels in First Row box For the Output Range, select an area of your sheet with nothing to the right or below. Click the summary statistics box. Click OK. If you get a message about overwriting data, click cancel and try again with a different output range. Your results should look something like this: (note – for demonstration purposes I highlighted the hours column and one of the temperature columns, you should not do descriptive statistics on the time columns). D:\769817416.doc Page 3 of 9 Last printed 4/30/2017 1:13:00 AM There is a lot of data here; this isn’t a statistics course and we won’t go over it all. The mean Mean 84.00462963 Mean 16.39032387 is the average of all the Median and Standard Error 0.569369161 Standard Error 0.013454032 temperatures. mode are also measures of Median 84.00462972 Median 16.353 where the “center” of the data Mode #N/A Mode 15.907 points lies. Standard error, Standard Deviation 48.51011886 Standard Deviation 1.146280364 sample variance, and Sample Variance 2353.231632 Sample Variance 1.313958673 standard deviation are all Kurtosis -1.2 Kurtosis -0.376123311 measurements of how close Skewness -1.09689E-11 Skewness 0.003505394 the data points are to each Range 168.0092592 Range 4.969 other. Kurtosis and skewness determine if the Minimum 0 Minimum 13.81 help population is distributed Maximum 168.0092592 Maximum 18.779 normally; minimum, maximum Sum 609789.6065 Sum 118977.361 and range tell you the high Count 7259 Count 7259 and low points and how far apart they are; the sum is calculated by adding up all the data points, and the count is the number of data points. Divide the sum by the count and you get the mean, which is where we started. Time (Hours) Temperature 1 Time vs. Temperature Step 3 – ANOVA 20 A 15 B o Temperature ( C) O.K. – You’ve got descriptive statistics for all three of the probes. At this point, you should be able to put each of the mean values you just calculated together with the figure to the right and match up the means with one of the 3 lettered lines. With the means, you can answer the question, “which of these probes recorded the highest average temperature – A, B, or C?” 10 5 0 C -5 0 24 48 72 96 120 144 168 192 Time (hours) Of course, that question was pretty easy to answer even without doing the statistics. A more difficult question is: “are any (or all) of these means significantly different from each other?” Think of it this way – minor fluctuations, electrical glitches, software errors, etc. could all Daily Air Temperatures at 4 Different Points in an Office lead to apparently random differences in temperature. Also, note that while probe C was usually below the temperature of probe B, it wasn’t always lower, and the fluctuations introduce uncertainty about where the mean really is. Looking at the data, we would guess that there is a statistical difference between the means, but we really should test to be sure. 40 Temperature (oC) 35 30 C C C 25 A A B B Other cases might not be as clear cut. What would you say about the data in the graph to the left, for instance? Fortunately, you won’t have to answer that question , at least not yet. A B C 20 D D Let’s get back to the questions and data at hand. The first test we will run is the ANOVA test. The ANOVA test allows us to quickly test multiple samples to see if any of them are significantly different. If so, then we must run multiple t-tests to determine which means are different – a t-test can only be run on two sets of data at a time. 15 12:00 PM 12:00 AM D 12:00 PM 12:00 AM 12:00 PM 12:00 AM 12:00 PM Time D:\769817416.doc Page 4 of 9 Last printed 4/30/2017 1:13:00 AM To do the ANOVA: 1. Select Tools:Data Analysis from the menu. 2. Choose ANOVA: Single Factor. 3. Fill out the form as shown to the right. 4. Click OK The ANOVA table will be generated; a sample is located below. In the summary portion, the ANOVA table repeats some of the information of the descriptive statistics, such as the count, the mean, and the variance for each of the columns. The true ANOVA table comes next. The SS column refers to the sum of squares, and is basically the squared difference between (or within) the groups. The df refers to the degrees of freedom; with 3 groups there are 2 degrees of freedom, and within a group the Anova: Single Factor degrees of freedom are equal SUMMARY to the number of Groups Count Sum Average Variance measurements minus 1. Don’t Time (Seconds) 7259 2195242583 302416.6667 30497881944 worry about the Time (Hours) 7259 609789.6065 84.00462963 2353.231632 MS. Focus on the Temperature 1 7259 118977.361 16.39032387 1.313958673 F value. If the Fvalue is larger than the F crit, ANOVA then there is at least one pair of Source of Variation SS df MS F P-value F crit means with a Between Groups 4.42438E+14 2 2.21219E+14 21760.77494 0 2.996145554 significant Within Groups 2.21354E+14 21774 10165961433 difference. The Pvalue gives the Total 6.63792E+14 21776 chance of making a Type I mistake, where you assume the means are different when in fact they are the same (and random chance in sampling or measurement makes them appear different). In this example, the F-value is much greater than the F crit, so we reject the hypothesis that all 3 means are the same. Note that I ran the test on the two time values and one of the temperatures, so the low P-value shouldn’t be a surprise! At least one of the means is significantly different from one of the others. We will have to turn to t-tests to ferret (Mustela nigripes) out which. Step 4 – t-test. The t-test allows us to narrow down which means are different, but in contrast to the ANOVA, the t-test is limited to testing 2 sets of data at a time. The t-test helps you answer the question “Are the means of these two data sets the same or not?” Or, to be more precise, the t-test allows you to reject the hypothesis that the two data sets have the same mean with a certain chance of making a mistake. The possibility of making a mistake comes about because of the variation within natural populations. If you wanted to compare the heights of people in two different cities, you might watch 100 people pass though a doorway with the heights marked on it. If, by chance, in one city you did your measurements while an elementary school went on a field trip, and in the other city you caught the athletes at the city basketball tournament, you would conclude (incorrectly) that the two cities had different average heights. To protect against making this type of mistake you set a benchmark – the alpha () value at a high level. If you set it at 5%, that means there is only a 5% chance that you might erroneously conclude that the means are different when in fact you just had bad luck in sampling. D:\769817416.doc Page 5 of 9 Last printed 4/30/2017 1:13:00 AM It would be trivial to compare the time and temperature values. Of course they are different. Just for fun, I’ll do it here so you can see how the t-test works: t-Test: Two-Sample Assuming Unequal Variances Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Time (Hours) Temperature 1 84.00462963 16.39032387 2353.231632 1.313958673 7259 7259 0 7266 118.7198774 0 1.645062184 0 1.9602885 The t-test works by mathematically comparing the variances within the two samples with the difference in their means. The number that results from this is compared to a table of values computed for each possible alpha value. Of course, the computer doesn’t have a table to go to, the program generates the value on the fly. In Excel, you get a printout like the one above. The important numbers to look at is the t Stat, the P values, and the t Critical values. The t-stat is the number generated by the computer based on your data. The bigger it is, the greater the significance of the difference between the means. The P values tell you the chance of erroneously saying the means are different. The smaller the number the better; you want it at least to be smaller than your alpha value. The t critical numbers are from the table generated by the computer. If your t Stat is greater than the t critical value then you can assume that the means are different with a chance of being wrong due to unlucky sampling of less than the alpha value you selected. The P values give you the exact chance of making that type of mistake; in the example above it is 0 (not much of a chance). In this case, we reject the hypothesis that the means are the same, and we’re pretty confident that the difference is real, not due to chance. What about the 1 vs. 2 tails? To put it in a nutshell, use the 1 tail test when you can predict the direction of the difference between the means. If you have been feeding one group of mealworms twice as much as another group, you would expect the group being fed to be heavier, and you would use a 1-tail test. On the other hand, if you were just comparing 2 populations of mealworms and knew nothing about their living conditions, you would have no way of knowing which population was eating better and therefore would be heavier. You would use the 2-tailed test. What would you do in this case? D:\769817416.doc Page 6 of 9 Last printed 4/30/2017 1:13:00 AM In this study the means that you will be comparing will come from the 3 temperatures. That means that you will have to do several t-tests, one comparing Temperatures 1&2, then comparing Temperatures 2&3, and finally, a 3rd t-test comparing temperatures 1&3. To do a t-test: 1. 2. 3. 4. 5. 6. 7. Select Tools:Data Analysis from the menu Choose t-test: Two Sample Assuming Equal Variances (if the variances are equal, otherwise choose unequal variances) Fill out the wizard as shown at the right. Your two columns of data (with labels) should be selected in the first 2 boxes. The mean difference should be 0. Check the labels box. Set the Alpha at 0.05 Set the output range to an open area on the worksheet. Step 5 – Putting it all together. All of this data and analysis are useless if you don’t do something with it. The data and analysis are used to help you reach conclusions and to support your arguments as to why your conclusions are right. The data itself is useless unless it leads you to a conclusion. Your assignment is to complete the next page (cut and paste into your own document), and to write a short paper to answer these questions: 1. Does the hibernaculum maintain a different temperature than the outside air? 2. Does the hibernaculum protect the turtles from freezing? The text of your paper should only be a page or two; but since you will be pasting in tables from Excel, the number of pages might be longer. There should also be a paragraph (background) about box turtles (from your library research); this paragraph should be appropriately referenced. Each of the answers to the two questions should be backed with data and analysis as shown by material pasted in from Excel. In summary you will be turning in: Answers to questions on the next page. A short paper with background on box turtles and analyzing the results. Include a bibliography. A photocopy of the distribution map – reference where it came from. Note: The Excel file is too large to fit on a floppy. If you want to copy it and take it elsewhere, start with the file Hibersmall.xls, which has the graphs deleted, and copy only the Run3 worksheet to a new file. This will create a smaller file that should fit on a floppy. Complete assignment 2 (page 9) only after you have received Assignment 1 back. Other hints: The Excel file is very large. To minimize problems, keep as few programs open as possible. For instance, only open Word after you have done all of the work in Excel, and after you have pasted the material from Excel to Word, close Excel before continuing to format in Word. D:\769817416.doc Page 7 of 9 Last printed 4/30/2017 1:13:00 AM Assignment 1 Name: ______________________________________________ Time vs. Temperature 20 1. One probe was located outside, and one was in the hibernaculum. Where was the 3rd probe? Temperature ( C) 2. What was the location of each of the probes? Probe A: B o Your answer here A 15 10 5 0 Your answer here C -5 Probe B: Your answer here Probe C: Your answer here 0 24 48 72 96 120 144 168 Time (hours) 3. While it appears that the time started at midnight, in actuality it did not. At what time of day did the recording start? Explain your reasoning. Your answer here 4. How often was data recorded from the probes? Your answer here For the next 3 questions, paste in your answers from Excel. Make sure it is clear what results are being presented, i.e. that the labels are clear. You will also need to use or paste some of this information into your paper. 5. Paste in your descriptive statistics here: Your answer here 6. Paste in your ANOVA table here: Your answer here 7. Paste in your t-test results here: Your answer here D:\769817416.doc Page 8 of 9 Last printed 4/30/2017 1:13:00 AM 192 Assignment 2 Name: ___________ _________________________________ Due Date: ______________________ Note: Do not begin this assignment until the first assignment has been returned. To complete this assignment, you will use the data in the worksheet Hibernaculum-TidBit in the file Hiber.xls. There are 10,446 rows of data. Temperature Inside and Outside Hibernaculum - 1999 20 15 10 o Temperature ( C) For this assignment, you will compare data from TidBit data loggers (these probes are waterproof and were left inside and outside the hibernaculum unconnected to the computer. Their data covers a several week period. The question you are trying to answer is this: Was the average temperature inside the hibernaculum greater or less than the temperature outside? 5 0 -5 -10 2/15 2/22 1. Paste the descriptive statistics for each of the two columns here. 3/1 3/8 Date Your answer here 2. Paste the t-test comparison here: Your answer here 3. Write a paragraph or two answering the questions and otherwise interpreting the results. Be sure to mention and discuss any differences in the variability of the temperatures at the two sites. Your answer here D:\769817416.doc Page 9 of 9 Last printed 4/30/2017 1:13:00 AM 3/15