Download simpstats

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Student's t-test wikipedia , lookup

Transcript
Simple Statistical Analyses
There are very few simple Yes or No questions in biology. Most biological questions are
answered by gathering quantitative data (rainfall rates, Potassium influx rates, area of an
animal’s home range, etc.) and making INFERENCES from this information. When you begin
working with quantitative data you come in contact with two sources of potential errors:
1) BIOLOGICAL VARIABILITY (no two individual animals are exactly alike) and
2) RANDOM EVENTS (or chance occurrences).
How do you determine if our quantitative measurements are accurately describing the “real”
biological situation you are studying? The basic theme of this laboratory is the use of several
simple statistical tests to first, extract worthwhile information about a POPULATION when you
only have data from a small portion (SAMPLE) of the entire population; and secondly, to
determine whether you can say with some degree of confidence that two samples come from
different (or similar) populations. This question is very frequently encountered in many
branches of science.
A POPULATION is considered here to be some set of items, which are all similar and which
reside within a definable boundary. It’s up to the investigator to decide what “population” she is
going to work with. Examples of defined populations include, all residents of Kāne‘ohe between
the ages of 30 and 60; all pineapple plants in a 5 acre plots the heights of all trees on a particular
site; the monthly rainfall values at UH over the last 10 years, etc. Note that individuals or
measurements on individuals may compose a population. Since it is usually impractical, or even
impossible, to measure all of the items under study in a population, (i.e. a CENSUS) you
generally try to describe the characteristics of a population by taking measurements of a portion
of the population. The portion is our SAMPLE. The methods used to obtain unbiased samples
will be extensively dealt with below, but for the moment you need not worry about how you get
your sample, but as you are sampling in this exercise think of the possible sources of bias. Once
the sample is obtained, you will estimate the value of PARAMETERS, which describe the
population and assign some CONFIDENCE LEVEL. to these descriptions. Then you will
compare our population estimates to determine if they are different.
II.
Making the Measurements
In the lab you will be provided with several bags containing Koa Haole (Leucaena
leucocephala (Lam. de Wit) seedpods. Two of the bags of seedpods were collected from one
tree while the third bag of seedpods came from a different tree. The goal of today’s exercise is to
1) determine which two bags came from the same tree and which bag came from a
different tree.
To do this you will make measurements and observations of the seedpods and analyze
the data you collect..
1
BUT there are far too many seedpods in each bag for you to measure them all (i.e. do a
census) so you will have to work with only some seedpods from each bag – which ones??
This will be the second goal of today’s exercise
2) How can you obtain an unbiased sample of seedpods to measure from the population
of seedpods in each bag?
Each bag will be coded with a letter and you will collect and analyze samples from all
three bags. Your first chore will be to determine how to get an unbiased sample from the bag.
This is not an easy task. Your group should discuss this among yourselves and try to
1) think about possible sources of bias in getting the sample
2) devise ways to avoid or reduce each of these sources of bias.
Your next chore will be to determine what the sample size should be. Obviously the total
number of seedpods in the bag is too large for you to be able to measure them all (also remember
the contents of each bag are themselves a sample [of the tree]). On the other hand too small a
sample may not give you a good estimate of the population parameters. Can you devise an
objective mechanism to tell you what an appropriate sample size should be? What information
would you need to know before you can answer that question?
Before you begin, check with the TA or teacher to go over your sampling protocol.
There are many traits you could determine for the seedpods including chemical and
physical factors. For the purposes of this exercise you will limit the traits measured to two, one
continuous (seedpod length) and one discrete, number of seeds per pod. Measure the length of
each pod using metric values (CM) and also record the number of seeds in that pod. Record all
pertinent data in your lab notebook.
III.
Sample Population Statistics
It is always important to get a “feel” for your data before you spend much time in more
complex analyses. Simply looking at the data (as in a list of measurements) isn’t likely to be
much help so graphic visualizations are used. A very basic graphic visualization type is the
FREQUENCY HISTOGRAM. To make a frequency histogram you need to group the
measurement data into discrete intervals. This is more straightforward for discrete variables
(such as number of seeds per pod) but may require some thinking for continuous variables (such
as pod length). The choice of interval is up to you but there is usually some optimum number of
intervals that maximizes the visual information available. This, of course, may be different in
different circumstances. But remember if you want to make comparisons between data sets you
may want to use the same X axis for all of them. Tabulate your data values in the interval
categories to get that the numbers of measurements in each category. These are then plotted as
histogram with the interval along the horizontal (X) axis and the frequency along the vertical
(Y) axis.
2
For many biological samples, the resulting frequency distribution is often “bell shaped”.
This bell shaped curve is informally referred to as a NORMAL DISTRIBUTION (The real
definition of a normal distribution is given by the relationship between two parameters [which
we will discuss below]). Why should most measurements on biological material be distributed
in such a manner, with most of the measurements clustering about an “average” value, but with a
few extreme values on either side? The answer lies partly in the genetic variability, which is
inherent in all biological material. No two individuals are genetically exactly alike. While
closely related individuals have many of the same genes in common, slight genetic differences
do exist. Thus, if the fur color of rabbits has a genetic basis, then a good camouflage color like
brown will be the most common color in the population. However, a few rabbits will carry genes
for both lighter and darker coat color, especially if fur color is under the control of several genes.
But you also see normal distributions in situations where genetics – or even biology – play no
part so there is more (lots more) to this tendency for values to be distributed this way.
The nice thing about the normal distribution is that it is symmetrical, so that any
particular normal distribution can be completely specified (or described) if you know just two
parameters, the central value the MEAN (or average); and the distance from the mean to the
point of inflection (the STANDARD DEVIATION). You are vary familiar with the concept of
a mean or an average. The standard deviation is basically a measure of how broadly the values
are scattered around the mean.
Statistical analyses dealing with normal (also called parametric) distributions rely heavily
on these two parameters mean and standard deviation (or its square the variance). The majority
of this exercise will be devoted to the methods of calculating these statistics and seeing what can
be done with them. But you must always remember that parametric statistics should only be
used when the distribution of data is known to be normal. (See Ch.6 in Sokal & Rolf for a more
extensive discussion of the normal distribution – Ch. 4 & 5 discuss other useful descriptive
statistics and distributions.
IV.
V.
Calculation of sample Mean and Standard Deviation:
The distribution of sample values is normal, the mean and standard deviation may be
calculated from the sample measurements by following the procedures given below.
1. The mean
X̄ =  Xi /n
where
X̄
Xi
n

is the symbol for the mean
is the ith individual measurement
is the total number of measurements (the sample size)
is a symbol, which means “the sum of all measurements
from i = 1 to I = n”
3
2. The standard deviation
S=
{ (Xi - X̄)2 /
(n-1)}
or a similar formula obtained through algebra, which is more convenient when using
calculating machines,
S=
{ Xi2 –[(Y1)2/n] /
(n-1)
where s is the symbol for the standard deviation and the rest of the notation is the same
as given for the mean.
Many pocket calculators are programmed to calculate the mean and standard deviation
directly from your data but you should study these formulae to get an understanding of the
meaning of these statistics. Other statistics (e.g. mode, median, range, etc.) also have their uses
but you will use them less in this lab.
VI.
Comparing Two Populations
In almost all cases where you compare data obtained from two samples they will be
different, even if they are from the same population. Imagine you flip a coin 10 times and keep a
record of the results (e.g. H T T H T etc.) its unlikely (and more and more unlikely as you flip
the coin more and more times) that you will get the same sequence of heads and tails. BUT what
about the mean number of heads? If it is an honest coin you expect that on average you will get
about 50% heads in a sample of appropriate size (and exactly 50% heads in a sample of infinite
size). But in two sets of 10 coin tosses are you going to always get 5 heads and 5 tails even if it is
an honest coin with no bias in the tossing? There are statistical tests to help you determine
whether two samples that are somewhat different may in fact have come from the same
population. An assumption of the tests you will use in this exercise is that the underlying
distributions are normal. These parametric tests are based on asking the question
“What are the chances that two unbiased samples taken from the same population will
differ by the amount that my two samples differ?”
Note there are actually two populations under consideration here the real world
population that you took your samples from and an ideal population with certain known features.
The tables you will use (or can find in any statistics book) are based (at least theoretically) on
drawing many samples of a given size (n) from this ideal population and determining how much
difference is found. That is (in theory), the table-maker takes thousands of samples of size 2,
4
thousands of size three, thousands of size four, and so on and so on, and gets the distribution of
the parameters (say mean and standard deviation) of each set of 1000 samples. Since the tablemaker knows that the samples were drawn from the same population the differences she finds are
those expected when two (or more) samples are drawn from what is really the same source
population. As you would expect most of the pairwise differences are small (e.g. for most pairs
of means the differences are not very big) but out of 1000 samples you would be sure to find a
couple of means that were pretty different by chance alone. The tables you will use are
constructed such that for two samples (so far you are talking about pairwise tests, but all this
stuff can be generalized to more than two samples) of a specific size (n), drawn from the same
population, you can look up the chances (probability) that the differences between your two
samples from the real world are bigger than you would expect if they had been drawn from the
same population. So to enter the table you need to have some statistic (a single number) that
compares your two populations (in this exercise you will use one ratio and one difference) the
sample size [or actually the degree of freedom (in our case df = n-1)] and the level of probability
that will make you happy. What does this happiness depend on? You’ll think back to the ideal
population and the thousands of samples (or pairs of samples with their comparative statistic). As
you saw most sample parameters will be similar (so the ratio will be close to 1 if you are looking
at ratios, or the difference will be close to zero if you are looking at differences), but in a few
cases the numbers will be pretty big. In 1000 pairs of samples there will be a few with a really
big difference just by chance. With a 1000 sample pairs you can count how many are greater
than some value (what you will come to call the significance level – though significant to who or
what is never really clear). You will find that in 1000 samples only100 have a difference (or
ratio) greater than some value. So then you can say: “In only 10% (100 out of 1000) of cases of
pairs of samples drawn from the same population will the differences (or ratios) be bigger than
this”. Of course you can do this for 5% 1% .01% or whatever. {Don’t worry about the poor table
maker all this is done by algorithms on computers today}
So the end result of all this is that you can decide on some chance of being wrong (that is
saying the difference (or ratio) you found is due to chance when it actually isn’t) that you are
willing to accept then set that value as your level of rejection of the null hypothesis (that there is
really no difference). To get this measure of the difference between the estimates of the
population measurements for two samples in this exercise, you will perform two statistical tests.
This will allow you to objectively state (with a certain degree of confidence, say 95% sure) that
you think that your two samples are from one population or two.
The first test indicates whether the difference in variability among measurements in
your two samples (as measured by the standard deviation – or actually the variance) are similar.
If they are not (i.e., there is a different amount of variability in the two samples – the shapes of
the two distributions differ) then you are reasonably sure that you have samples from two
populations (Why?) and need not carry out the second part of the test.
The sample standard deviations are compared arbitrarily by assigning the larger
standard deviation to s1 and the smaller to s2. The F value (which is always equal to or
greater than 1.0) is calculated as [Note here that you are actually using the variance which
is defined as the standard deviation squared]
F = s21 / s22
5
The next thing you need to know is the degree of freedom for each sample. These
are calculated as
df1 = n1 – 1
df2 = n2 – 1
You use these values (which are really correction factors which compensate for the fact
that you are using samples rather than the whole population) in Table 2 to obtain the
listed F value. If the F value (the ratio of the two variances) which you calculated, is
greater than the value listed in Table 2, you are 95% confidant that the two samples are
from different populations (if you use the 95% table of course). If this is found to be the
case with your comparison you need not make any further tests and can state you are 95%
confident your samples came from two different populations. Think about this means –
If your samples have a normal distribution and their standard deviations are so different
that you get a large F value (for your sample sizes). You may state with a one in twenty
chance of being wrong that the two samples come from different populations. If the 95%
confidence level is not high enough for you, there is F table available for 99% and even
99.9% confidence!
A calculated F value, which is less, than that taken from the F table indicates there
is no evidence that the samples are from different populations based on the shapes of the
distributions. Note – a small F value does NOT mean the samples ARE from the same
population, only that the shape of the frequency distributions of the samples do not
permit you to feel confident that they are different. Look carefully at the F table noting
how the values change with sample size and with the selected confidence level.
When the F test does not suggest the two samples are from different populations
you will then test the difference between the means. To test whether two sample means
are estimators of the same population, you perform a t-test. The calculation is
t =
where
and
sp
where
s1 2
Y1 – Y2
sp * ( 1/n1 + 1/n2)1/2
Y1
is the mean of sample 1 (ditto for sample 2)
n1
is the size of sample 1 (ditto for sample 2)
=
({(df1 * s12) + (df2 * s22)} / (df1 + df2) )1/2
is the square of the standard deviation and is
referred to as the variance of the values.
Note that all the t value is is the difference between your two sample means, corrected
for variability in the samples and the sample size. Look in the t-table under the degrees of
freedom given by df1 + df2. If the t value which you calculated is greater (i.e. there is a big
difference between the two means) than the value listed in Table 1, you can be 95% confident
that the means of the two samples are significantly different. If your calculated t value is smaller
than the value from the t-table, you have no evidence that the samples are significantly different.
If your calculated t values are smaller than the value from the t-table, you have no evidence that
the samples come from different populations. You could still be wrong (in fact you will be
wrong 1 time out of every 20 tries if you use the 95% confidence level). This is called a type II
6
error you accept the null hypothesis (of no difference) when in fact they samples come from
different populations. A type I error is the opposite – you incorrectly reject the null hypothesis
when in fact the two samples did come from the same population. How will selection of a
different confidence level effect the chances of a type I or a type II error? THINK about this.
7
VI.
The chi-square test
If you are working with counts (discrete measures) rather than continuous measurements
(e.g., length, weight, etc.), you can use the Chi-square test to test for deviations from
some EXPECTED result. For example, modern genetic theory tells us that the sex ratio of
most animal populations should be 50% males and 50% females. In a sample of finite
size (such as the students in this class) a departure from this 50:50 ratio is not surprising.
If the sample is big enough and the departure from 50:50 is great enough then you might
suspect something is skewing the sex ratio in the population you sampled. A significant
departure from this 50:50 ratio (your expected result) when you sample a population
indicates that you need to look into the reasons behind such a large deviation from the
expected result. Possible explanations for differing sex ratio include high mortality upon
one sex, sex reversal, or an anomalous genetic system.
VII.
Assignments
There will be three bags of seedpods, labeled A, B, C. Two bags will contain seedpods
from the same population (a single tree in this case), and the other will have pods from a
different population (tree).
1. Divide up into groups of two – three persons each.
2. Take one sample from each of the three boxes.
There are two questions you must ask before you begin (These two questions will be
central to the rest of your semester’s work in this course!)
a) How big a sample will you need to take to be able to have some assurance that your data
will be sufficient? You have seen how the sample size (actually the degree of freedom)
determines the outcome of your statistical tests so how big a sample will you need to be able
to test your data?
b) How do you obtain your sample? That is, assuming you will not use all the seedpods in the
bag, how will you choose which seed pods you will measure?
Do not begin sampling, or measuring until you have
worked out your sample size and your sampling
technique!!
3. Measure the lengths of each pod and also record the number of seeds per pod, for each of the
three samples.
4. Plot the frequency histogram for each sample.
5. Compare all three samples using the F-test.
6. Where appropriate do the t-test.
Which of the three samples is from a different population? How do you know? Are there
any biological clues, which also aid in your decision?
7. You may be given a data set for which you can try out the Chi square test. If so, calculate 2
values for that data set
8
8. From the current ecological literature obtain data suitable for 2 analysis where the author
hasn’t done so. Perform the analysis and discuss your results.
9.
Hirth (Ecological Monographs 33:83-112, 1963) studied two species of lizards inhabiting a
beach in Costa Rica. Each month for four months he captured and sexed all the juveniles and
adults of each species. The data for one species is listed below.
The procedure for using the Chi-square test for the June adult sample is as follows:
Chi-square
(x2) =
(observed – expected)2
expected
The expected values for both males and females is 38 + 58 = 48
2
2
2
Therefore
x
=
(38 – 48)
+
(58 – 48)2
=
4.17
48
48
Compare this value with that given in Table 3 for 1 degree of freedom. The value 4.17 is larger
than 3.814, so you can be 95% confident that the June samples deviates significantly from the
expected 50:50 sex ratio. (The degrees of freedom for the X2 test are calculated as the number of
groups minus 1. You had two groups, male and female, so our df is 2 – 1 = 1).
Calculate the X2 value for each of the months and also for the total, for both juveniles and adult
data separately. Which are significantly different? Does the statistical test tell you why they are
different?
Another example of the use of the Chi-square test is shown in Krebs on page 375-379.
9