Download Invertebrate Zoology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Vertebrate Zoology
Statistics Assignment
Due Date: ______________________
The goals of this assignment are to:
1. Introduce or reinforce basic statistical and data analysis skills.
2. Apply critical thinking to the refined data to answer questions based on the evidence collected.
One of the goals of a scientist is to be able to answer questions with the greatest possible reliance on
observable facts, and the least reliance on intuition. While intuition has great importance in finding the
right questions to ask, and in finding ways of investigation, once the data is gathered the scientist should
rely on the facts at hand. Patterns in the data may be revealed through good graphical analysis, and the
patterns should then be tested with statistics to see if they are “real” – or simply the result of the scientist
looking at the data and “seeing” a preconceived result. This is an example of bias; one shields oneself
from bias by using commonly agreed upon statistical tests as impartial arbitrators of what is “real”
Sometimes the results are unambiguous. Every time you drop a penny it falls to the ground. No one
needs statistical analysis to prove the existence of gravity. On the other hand, sometimes the penny lies
heads up, sometimes heads down. Determining if this is a random event or something influenced by
other factors may require the application of statistics; statistics are also useful to draw conclusions about
a larger population by sampling a smaller portion of it.
Results in biology are seldom so clear-cut as to eliminate the need for statistics. There are several basic
tests and graphical analyses that should be in every biologist’s “toolkit”. Among the graphing techniques
are:
1. The scatterplot, which is used to look for correlation between two variables, or to track a variable
over time.
2. The trendline, which is the superposition of a line drawn from a mathematical model over a
scatterplot.
3. The histogram, which is used to look for patterns in abundance.
Any pattern that is revealed by the graphical analysis should be examined by statistical tests to see if the
pattern is “real”. In most cases, this means determining if the pattern is different enough from what might
be expected in a random world. For instance, flipping 51 heads out of 100 tosses would not be
unexpected; flipping 80 heads out of 100 tosses, or flipping 20 heads in a row might be unexpected and
suggest that something else is at work. The statistical tests that will be of the most use to you in testing
apparent patterns are:
1.
2.
3.
4.
The t-test, which is used to tell if two averages (the composite of many measurements) differ in a
statistically significant way.
The correlation coefficient, which is used to test the statistical significance of a trendline.
The Chi-square test, which is used to determine if experimental results differ enough from
expected results to suggest “real” difference.
The ANOVA test, which is kind of a “super” t-test to tell if any of a group of mean values differs
from the rest. If the results are positive, you then have to go back with multiple t-test and see
which mean or means is different
In this exercise, you will calculate basic descriptive statistics.
summary, conduct t-tests and an ANOVA test.
D:\769817416.doc
Page 1 of 9
You will also generate a statistical
Last printed 4/30/2017 1:13:00 AM
Biological Background (do some research to fill in the blanks):
Eastern Box Turtles, Terrapene _______________________ are a primarily terrestrial species in
the mostly aquatic turtle family _______________________. In the northern portion of their range, these
turtles must hibernate through the winter. Typically, they do this by burrowing into a bank or the forest
floor, trying to get themselves below the frostline, if possible. Hibernation is an essential part of the life
cycle; during hibernation hormonal levels are “reset” and the breeding season follows after emergence
from hibernation.
Box turtles in captivity must be hibernated to maintain their health. The box turtles at Marietta
College were hibernated in a rooftop hibernaculum during the winter of 1998-1999. The hibernaculum
was instrumented with a Vernier Software MLI (multiple lab interface) package of Direct-Connect
Thermometer probes (DCT’s) connected to a Gateway 486 computer (which in turn was linked to the
college network). Data from two periods of the winter hibernation are found in the file Hiber.xls.
Graphed, the data you will be examining looks like this:
Time vs. Temperature
20
A
15
B
o
Temperature ( C)
Among
other
things,
you
will
be
examining this data and
determining which probe
was in the hibernaculum,
which probe was exposed
to the outside air, and
where the third probe was
located. You will also be
looking at the data and
determining
if
any
differences in the average
temperatures recorded by
the
3
probes
are
statistically significant.
10
5
0
C
-5
0
24
48
72
96
120
144
168
Time (hours)
Step 1: Researching background information.
Go to the library and find the following information:
1. The scientific name of the eastern box turtle, and the family it is placed in.
2. The range of the eastern box turtle – turn in a photocopy of the map.
3. Other ecological information, such as clutch size, longevity, body size, rate of growth, diet, predators,
etc.
D:\769817416.doc
Page 2 of 9
Last printed 4/30/2017 1:13:00 AM
192
Step 2 – Descriptive Statistics
In the computer lab, you should go to Excel and open up the file Hiber.xls, which is located in the
K:\Classes\Vertebrate directory (in the Bartlett lab). Save the file to the C: drive of the computer you are
working on. Go to the sheet labeled Run 3. The first rows of data should look something like this:
Time (Seconds) Time (Hours)
Temperature 1 Temperature 2 Temperature 3
0
0
18.204
11.827
7.39
83.333
0.023148056
17.407
11.297
7.758
166.667
0.046296389
17.399
11.296
7.771
250
0.069444444
17.415
11.295
8.664
333.333
0.0925925
17.422
11.298
9.429
416.667
0.115740833
17.425
11.302
9.948
500
0.138888889
17.425
11.314
11.235
583.333
0.162036944
17.403
11.269
11.024
666.667
0.185185278
17.39
11.256
12.271
750
0.208333333
17.367
11.271
13.227
833.333
0.231481389
17.373
11.254
12.719
1.
2.
3.
4.
5.
6.
7.
8.
9.
Select Tools:Data Analysis from the menu.
Select Descriptive Statistics from the list that is presented, and click OK.
The Descriptive Statistics Wizard will come up. Fill it out in a similar way to the one presented
here:
In the Input Range enter the cells where your
data can be found. You can click on the small
box to the right to go to the spreadsheet and
highlight your data.
Only highlight the 3
temperature columns; do not highlight the time
columns! Include the column headings. Note –
there are 7,260 rows of data! You might find it
easiest to click on the cell at the top left of the
area you want to select, scroll to the bottom
using the scrollbar at the right of the screen,
and click on the bottom right cell while holding
down the shift key.
Be sure to click the Labels in First Row box
For the Output Range, select an area of your
sheet with nothing to the right or below.
Click the summary statistics box.
Click OK. If you get a message about overwriting data, click cancel and try again with a different
output range.
Your results should look something like this: (note – for demonstration purposes I highlighted the
hours column and one of the temperature columns, you should not do descriptive statistics on the
time columns).
D:\769817416.doc
Page 3 of 9
Last printed 4/30/2017 1:13:00 AM
There is a lot of data here; this
isn’t a statistics course and we
won’t go over it all. The mean
Mean
84.00462963 Mean
16.39032387 is the average of all the
Median and
Standard Error
0.569369161 Standard Error
0.013454032 temperatures.
mode
are
also
measures
of
Median
84.00462972 Median
16.353
where the “center” of the data
Mode
#N/A
Mode
15.907
points lies. Standard error,
Standard Deviation
48.51011886 Standard Deviation 1.146280364 sample
variance,
and
Sample Variance
2353.231632 Sample Variance
1.313958673 standard deviation are all
Kurtosis
-1.2 Kurtosis
-0.376123311 measurements of how close
Skewness
-1.09689E-11 Skewness
0.003505394 the data points are to each
Range
168.0092592 Range
4.969 other. Kurtosis and skewness
determine
if
the
Minimum
0 Minimum
13.81 help
population
is
distributed
Maximum
168.0092592 Maximum
18.779
normally; minimum, maximum
Sum
609789.6065 Sum
118977.361 and range tell you the high
Count
7259 Count
7259 and low points and how far
apart they are; the sum is
calculated by adding up all the data points, and the count is the number of data points. Divide the sum by
the count and you get the mean, which is where we started.
Time (Hours)
Temperature 1
Time vs. Temperature
Step 3 – ANOVA
20
A
15
B
o
Temperature ( C)
O.K. – You’ve got descriptive statistics for all
three of the probes. At this point, you should be able to
put each of the mean values you just calculated together
with the figure to the right and match up the means with
one of the 3 lettered lines. With the means, you can
answer the question, “which of these probes recorded the
highest average temperature – A, B, or C?”
10
5
0
C
-5
0
24
48
72
96
120
144
168
192
Time (hours)
Of course, that question was pretty easy to
answer even without doing the statistics. A more difficult
question is: “are any (or all) of these means significantly different from each other?” Think of it this way –
minor fluctuations, electrical glitches, software errors, etc. could all
Daily Air Temperatures at 4 Different Points in an Office
lead to apparently random differences in temperature. Also, note
that while probe C was usually below the temperature of probe B, it
wasn’t always lower, and the fluctuations introduce uncertainty about
where the mean really is. Looking at the data, we would guess that
there is a statistical difference between the means, but we really
should test to be sure.
40
Temperature (oC)
35
30
C
C
C
25
A
A
B
B
Other cases might not be as clear cut. What would you say
about the data in the graph to the left, for instance? Fortunately, you
won’t have to answer that question , at least not yet.
A
B
C
20
D
D
Let’s get back to the questions and data at hand. The first
test we will run is the ANOVA test. The ANOVA test allows us to
quickly test multiple samples to see if any of them are significantly
different. If so, then we must run multiple t-tests to determine which means are different – a t-test can
only be run on two sets of data at a time.
15
12:00 PM
12:00 AM
D
12:00 PM
12:00 AM
12:00 PM
12:00 AM
12:00 PM
Time
D:\769817416.doc
Page 4 of 9
Last printed 4/30/2017 1:13:00 AM
To do the ANOVA:
1. Select Tools:Data Analysis from the menu.
2. Choose ANOVA: Single Factor.
3. Fill out the form as shown to the right.
4. Click OK
The ANOVA table will be generated; a sample is located
below. In the summary portion, the ANOVA table repeats
some of the information of the descriptive statistics, such as
the count, the mean, and the variance for each of the
columns. The true ANOVA table comes next. The SS
column refers to the sum of squares, and is basically the squared difference between (or within) the
groups. The df refers to the degrees of freedom; with 3 groups there are 2 degrees of freedom, and
within a group the
Anova: Single Factor
degrees of
freedom are equal
SUMMARY
to the number of
Groups
Count
Sum
Average
Variance
measurements
minus 1. Don’t
Time (Seconds)
7259 2195242583 302416.6667 30497881944
worry about the
Time (Hours)
7259 609789.6065 84.00462963 2353.231632
MS. Focus on the Temperature 1
7259 118977.361 16.39032387 1.313958673
F value. If the Fvalue is larger
than the F crit,
ANOVA
then there is at
least one pair of
Source of Variation
SS
df
MS
F
P-value
F crit
means with a
Between Groups
4.42438E+14
2 2.21219E+14 21760.77494
0 2.996145554
significant
Within Groups
2.21354E+14
21774 10165961433
difference. The Pvalue gives the
Total
6.63792E+14
21776
chance of making
a Type I mistake,
where you assume the means are different when in fact they are the same (and random chance in
sampling or measurement makes them appear different). In this example, the F-value is much greater
than the F crit, so we reject the hypothesis that all 3 means are the same. Note that I ran the test on the
two time values and one of the temperatures, so the low P-value shouldn’t be a surprise! At least one of
the means is significantly different from one of the others. We will have to turn to t-tests to ferret (Mustela
nigripes) out which.
Step 4 – t-test.
The t-test allows us to narrow down which means are different, but in contrast to the ANOVA, the t-test is
limited to testing 2 sets of data at a time. The t-test helps you answer the question “Are the means of
these two data sets the same or not?” Or, to be more precise, the t-test allows you to reject the
hypothesis that the two data sets have the same mean with a certain chance of making a mistake. The
possibility of making a mistake comes about because of the variation within natural populations. If you
wanted to compare the heights of people in two different cities, you might watch 100 people pass though
a doorway with the heights marked on it. If, by chance, in one city you did your measurements while an
elementary school went on a field trip, and in the other city you caught the athletes at the city basketball
tournament, you would conclude (incorrectly) that the two cities had different average heights. To protect
against making this type of mistake you set a benchmark – the alpha () value at a high level. If you set it
at 5%, that means there is only a 5% chance that you might erroneously conclude that the means are
different when in fact you just had bad luck in sampling.
D:\769817416.doc
Page 5 of 9
Last printed 4/30/2017 1:13:00 AM
It would be trivial to compare the time and temperature values. Of course they are different. Just for fun,
I’ll do it here so you can see how the t-test works:
t-Test: Two-Sample Assuming Unequal Variances
Mean
Variance
Observations
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Time (Hours) Temperature 1
84.00462963
16.39032387
2353.231632
1.313958673
7259
7259
0
7266
118.7198774
0
1.645062184
0
1.9602885
The t-test works by mathematically comparing the variances within the two samples with the difference in
their means. The number that results from this is compared to a table of values computed for each
possible alpha value. Of course, the computer doesn’t have a table to go to, the program generates the
value on the fly. In Excel, you get a printout like the one above. The important numbers to look at is the t
Stat, the P values, and the t Critical values. The t-stat is the number generated by the computer based
on your data. The bigger it is, the greater the significance of the difference between the means. The P
values tell you the chance of erroneously saying the means are different. The smaller the number the
better; you want it at least to be smaller than your alpha value. The t critical numbers are from the table
generated by the computer. If your t Stat is greater than the t critical value then you can assume that the
means are different with a chance of being wrong due to unlucky sampling of less than the alpha value
you selected. The P values give you the exact chance of making that type of mistake; in the example
above it is 0 (not much of a chance). In this case, we reject the hypothesis that the means are the same,
and we’re pretty confident that the difference is real, not due to chance.
What about the 1 vs. 2 tails? To put it in a nutshell, use the 1 tail test when you can predict the direction
of the difference between the means. If you have been feeding one group of mealworms twice as much
as another group, you would expect the group being fed to be heavier, and you would use a 1-tail test.
On the other hand, if you were just comparing 2 populations of mealworms and knew nothing about their
living conditions, you would have no way of knowing which population was eating better and therefore
would be heavier. You would use the 2-tailed test. What would you do in this case?
D:\769817416.doc
Page 6 of 9
Last printed 4/30/2017 1:13:00 AM
In this study the means that you will be comparing will come from the 3 temperatures. That means that
you will have to do several t-tests, one comparing Temperatures 1&2, then comparing Temperatures 2&3,
and finally, a 3rd t-test comparing temperatures 1&3.
To do a t-test:
1.
2.
3.
4.
5.
6.
7.
Select Tools:Data Analysis from the menu
Choose t-test: Two Sample Assuming Equal Variances (if the variances are equal,
otherwise choose unequal variances)
Fill out the wizard as shown at the right.
Your two columns of data (with labels)
should be selected in the first 2 boxes.
The mean difference should be 0.
Check the labels box.
Set the Alpha at 0.05
Set the output range to an open area on
the worksheet.
Step 5 – Putting it all together.
All of this data and analysis are useless if you don’t do something with it. The data and analysis are used
to help you reach conclusions and to support your arguments as to why your conclusions are right. The
data itself is useless unless it leads you to a conclusion.
Your assignment is to complete the next page (cut and paste into your own document), and to write a
short paper to answer these questions:
1. Does the hibernaculum maintain a different temperature than the outside air?
2. Does the hibernaculum protect the turtles from freezing?
The text of your paper should only be a page or two; but since you will be pasting in tables from Excel,
the number of pages might be longer. There should also be a paragraph (background) about box turtles
(from your library research); this paragraph should be appropriately referenced. Each of the answers to
the two questions should be backed with data and analysis as shown by material pasted in from Excel.
In summary you will be turning in:
Answers to questions on the next page.
A short paper with background on box turtles and analyzing the results. Include a bibliography.
A photocopy of the distribution map – reference where it came from.
Note: The Excel file is too large to fit on a floppy. If you want to copy it and take it elsewhere, start with
the file Hibersmall.xls, which has the graphs deleted, and copy only the Run3 worksheet to a new file.
This will create a smaller file that should fit on a floppy.
Complete assignment 2 (page 9) only after you have received Assignment 1 back.
Other hints: The Excel file is very large. To minimize problems, keep as few programs open as possible.
For instance, only open Word after you have done all of the work in Excel, and after you have pasted the
material from Excel to Word, close Excel before continuing to format in Word.
D:\769817416.doc
Page 7 of 9
Last printed 4/30/2017 1:13:00 AM
Assignment 1
Name: ______________________________________________
Time vs. Temperature
20
1. One probe was located outside, and one was in the
hibernaculum. Where was the 3rd probe?
Temperature ( C)
2. What was the location of each of the probes?
Probe A:
B
o
Your answer here
A
15
10
5
0
Your answer here
C
-5
Probe B:
Your answer here
Probe C:
Your answer here
0
24
48
72
96
120
144
168
Time (hours)
3. While it appears that the time started at midnight, in actuality it did not. At what time of day did the
recording start? Explain your reasoning.
Your answer here
4. How often was data recorded from the probes?
Your answer here
For the next 3 questions, paste in your answers from Excel. Make sure it is clear what results are being
presented, i.e. that the labels are clear. You will also need to use or paste some of this information into
your paper.
5. Paste in your descriptive statistics here:
Your answer here
6. Paste in your ANOVA table here:
Your answer here
7. Paste in your t-test results here:
Your answer here
D:\769817416.doc
Page 8 of 9
Last printed 4/30/2017 1:13:00 AM
192
Assignment 2
Name: ___________ _________________________________ Due Date: ______________________
Note: Do not begin this assignment until the first assignment has been returned.
To complete this assignment, you will
use the data in the worksheet
Hibernaculum-TidBit
in
the
file
Hiber.xls. There are 10,446 rows of
data.
Temperature Inside and Outside Hibernaculum - 1999
20
15
10
o
Temperature ( C)
For this assignment, you will compare
data from TidBit data loggers (these
probes are waterproof and were left
inside and outside the hibernaculum
unconnected to the computer. Their
data covers a several week period.
The question you are trying to answer
is this: Was the average temperature
inside the hibernaculum greater or
less than the temperature outside?
5
0
-5
-10
2/15
2/22
1. Paste the descriptive statistics for
each of the two columns here.
3/1
3/8
Date
Your answer here
2. Paste the t-test comparison here:
Your answer here
3. Write a paragraph or two answering the questions and otherwise interpreting the results. Be sure to
mention and discuss any differences in the variability of the temperatures at the two sites.
Your answer here
D:\769817416.doc
Page 9 of 9
Last printed 4/30/2017 1:13:00 AM
3/15