Download Biostatistics - A Taste of Chemistry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
AS statistics page 1
Statistics for AS Biology
Descriptive Statistics
mean
number of times each value
occurs
Repeated measurements in biology are
rarely identical, due to random errors
and natural variation. If enough
measurements are repeated they can be
plotted on a histogram, like the one on
the right. This usually shows a normal
distribution, with most of the repeats
close to some central value. Many
biological phenomena follow this
pattern: eg. peoples' heights, number of
peas in a pod, the breathing rate of
insects, etc.
95% CI
95% CI
normal
distribution
curve
values
The central value of the normal distribution
curve is the mean (also known as the
arithmetic mean or average). But how
reliable is this mean? If the data are all close
together, then the mean is probably good,
but if they are scattered widely, then the
calculated mean may not be very reliable.
The width of the normal distribution curve is
given by the standard deviation (SD), and the larger the SD, the less reliable the data. For comparing
different sets of data, a better measure is the 95% confidence interval (CI). This is derived from the SD, and
is the range above and below the mean within which 95% of the repeated measurements lie (marked on the
histogram above). You can be pretty confident that the real mean lies somewhere in this range. Whenever
you calculate a mean you should also calculate a confidence limit to indicate the quality of your data.
small confidence limit,
low variability,
data close together,
mean is reliable
large confidence limit,
high variability,
data scattered,
mean is unreliable
In Excel the mean is calculated using the formula =AVERAGE (range) , the SD is calculated using
=STDEV (range) , and the 95% CI is calculated using =CONFIDENCE (0.05, STDEV(range), COUNT(range)) .
This spreadsheet shows two sets of data
with the same mean. In group A the
confidence interval is small compared to
the mean, so the data are reliable and you
can be confident that the real mean is
close to your calculated mean. But in
group B the confidence interval is large
compared to the mean, so the data are
unreliable, as the real mean could be quite
far away from your calculated mean. Note
that Excel will always return the results of
a calculation to about 8 decimal places of
precision. This is meaningless, and cells
with calculated results should always be formatted to a more sensible precision (Format menu > Cells >
Number tab > Number).
HGS Biology
NCM/11/01
AS statistics page 2
Plotting Data
Once you have collected data you will want to plot a graph or chart to show trends or relationships clearly.
With a little effort, Excel produces very nice charts. First enter the data you want to plot into two columns (or
rows) and select them.
Drawing the Graph. Click on the chart wizard
. This has four steps:
1. Graph Type. For a bar graph choose Column and for a scatter graph (also known as a line graph) choose
XY(Scatter) then press Next. Do not choose Line.
2. Source Data. If the sample graph looks OK, just hit Next. If it looks wrong you can correct it by clicking
on the Series tab, then the red arrow in the X Values box, then highlight the cells containing the X data on
the spreadsheet. Repeat for the Y Values box.
3. Chart Options. You can do these now or change them later, but you should at least enter suitable titles for
the graph and the axes and probably turn off the gridlines and legend.
4. Graph Location. Just hit Finish. This puts the chart beside the data so you can see both.
Changing the Graph. Once you have drawn the graph, you can now change any aspect of it by doubleclicking (or sometimes right-clicking) on the part you want to change. For example you can:
 move and re-shape the graph
 change the background colour (white is usually best!)
 change the shape and size of the markers (dots)
 change the axes scales and tick marks
 add a trend line or error bars (see below)
Lines. To draw a straight "line of best fit" right click on a point, select Add Trendline, and choose linear. In
the option tab you can force it to go through the origin if you think it should, and you can even have it print
the line equation if you are interested in the slope or intercept of the trend line. If instead you want to "join
the dots" (and you don't often) double-click on a point and set line to automatic.
Error bars. These are used to show the confidence intervals on the graph. You must already have entered the
95% confidence limits on the spreadsheet beside the X and Y data columns. Then double-click on the points
on the graph to get the Format Data Series dialog box and choose the Y Error Bars tab. Click on the red
arrow in the Custom + box, and highlight the range of cells containing your confidence limits. Repeat for the
Custom - box.
HGS Biology
NCM/11/01
AS statistics page 3
Problems
1.
Here are the results of an investigation into the rate of photosynthesis in the pond weed Elodea. The
number of bubbles given off in one minute was counted under different light intensities, and each
measurement was repeated 5 times. Use Excel to calculate the means and 95% confidence limits of these
results, then plot a graph of the mean results with error bars and a line of best fit.
light
intensity
(Lux)
0
500
1000
2000
3500
5000
HGS Biology
repeat 1
repeat 2
repeat 3
repeat 4
repeat 5
5
12
7
42
45
65
2
4
20
25
40
54
0
5
18
31
36
72
2
8
14
14
50
58
1
7
24
38
28
36
NCM/11/01
A2 statistics page 1
Statistics for A2 Biology
There is a bewildering variety of statistical tests available, and it is important to choose the right one. This
flow chart will help you to decide which statistical test to use, and the tests are described in detail on the
next 5 pages.
normal
data
Testing for a
correlation
non-normal Spearman correlation coefficient
data
=CORREL (range 1, range 2)
on ranks of data
0=no correlation/ 1=perfect correlation
Plot
scatter
graph
Finding how one
factor affects another
Testing for
a relation
between 2 sets
Calculate
mean and
95% CI from
replicates
Measurements
start
here
What
kind
of
test?
Testing for
a difference
between sets
Linear regression
Add Trendline to graph and
Display Equation.
Gives slope and intercept of line
same
individuals
Paired t-test
=TTEST(range1, range2, 2, 1)
If P<5% then significant difference
If P>5% then no significant difference
different
individuals
Unpaired t-test
=TTEST(range1, range2, 2, 2)
If P<5% then significant difference
If P>5% then no significant difference
2 sets
Plot
bar
graph
What
kind
of
data?
>2 sets
Frequencies (counts)
Comparing observed
counts to a theory
What
kind
of
test?
Testing for a difference
between counts
Testing for an association
between groups of counts
HGS Biology
Pearson correlation coefficient
=CORREL (range 1, range 2)
0 = no correlation
1 = perfect correlation
ANOVA
Tools menu > Data analysis > Anova
If P<5% then significant difference
If P>5% then no significant difference
2 test
=CHITEST(obs range, exp range)
If P<5% then disagree with theory
If P>5% then agree with theory
2 test
=CHITEST(obs range, exp range)
If P<5% then significant difference
If P>5% then no significant difference
2 test for association
=CHITEST(obs range, exp range)
If P<5% then significant association
If P>5% then no significant association
NCM/11/01
A2 statistics page 2
Statistics to Test for a Correlation
Correlation statistics are used to investigate an association between two factors such as age and height;
weight and blood pressure; or smoking and lung cancer. After collecting as many pairs of measurements
as possible of the two factors, plot a scatter graph of one against the other. If both factors increase
together then there is a positive correlation, or if one factor decreases when the other increases then there
is a negative correlation. If the scatter graph has apparently random points then there is no correlation.
variable 1
No Correlation
variable 2
Negative Correlation
variable 2
variable 2
Positive Correlation
variable 1
variable 1
There are two statistical tests to quantify a correlation: the Pearson correlation coefficient (r), and
Spearman's rank-order correlation coefficient (rs). These both vary from +1 (perfect correlation)
through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and normallydistributed use Pearson, otherwise use Spearman. In both cases the larger the absolute value (positive or
negative), the stronger, or more significant, the correlation. Values grater than 0.8 are very significant,
values between 0.5 and 0.8 are probably significant, and values less than 0.5 are probably insignificant.
In Excel the Pearson coefficient r is calculated using the formula: =CORREL (X range, Y range) . To
calculate the Spearman coefficient rs, first make two new columns showing the ranks (or order) of the two
sets of data, and then calculate the Pearson correlation on the rank data. The highest value is given a rank
of 1, the next highest a rank of 2 and so on. Equal values are given the same rank, but the next rank
should allow for this (e.g. if there are two values ranked 3, then the next value is ranked 5).
In this example the size of breeding
pairs of penguins was measured to see if
there was correlation between the sizes
of the two sexes. The scatter graph and
both correlation coefficients clearly
indicate a strong positive correlation. In
other words large females do pair with
large males. Of course this doesn't say
why, but it shows there is a correlation
to investigate further.
HGS Biology
NCM/11/01
A2 statistics page 3
Linear Regression to Investigate a Causal Relationship.
If you know that one variable causes the changes in the other variable, then there is a causal relationship.
In this case you can use linear regression to investigate the relation in more detail. Regression fits a
straight line to the data, and gives the values of the slope and intercept of that line (m and c in the
equation y = mx + c).
The simplest way to do this in Excel is to plot
a scatter graph of the data and use the trendline
feature of the graph. Right-click on a data
point on the graph, select Add Trendline, and
choose Linear. Click on the Options tab, and
select Display equation on chart. You can also
choose to set the intercept to be zero (or some
other value). The full equation with the slope
and intercept values are now shown on the
chart.
In this example the absorption of a yeast cell suspension is plotted against its cell concentration from a
cell counter. The trendline intercept was fixed at zero (because 0 cells have 0 absorbance), and the
equation on the graph shows the slope of the regression line.
The regression line can be used to make quantitative predictions. For example, using the graph above, we
could predict that a cell concentration of 9 x 107 cells per cm3 would have an absorbance of 1.37 (9 x
0.152).
HGS Biology
NCM/11/01
A2 statistics page 4
T-Test to Compare Two Sets of Data
Another common form of data analysis is to compare two sets of measurements to see if they are the same
or different. For example are plants treated with fertiliser taller than those without? If the means of the
two sets are very different, then it is easy to decide, but often the means are quite close and it is difficult
to judge whether the two sets are the same or are significantly different. To compare two sets of data use
the t-test, which tells you the probability (P) that there is no difference between the two sets. This is
called the null hypothesis.
P varies from 0 (impossible) to 1 (certain). The higher the probability, the more likely it is that the two
sets are the same, and that any differences are just due to random chance. The lower the probability, the
more likely it is that that the two sets are significantly different, and that any differences are real. Where
do you draw the line between these two conclusions? In biology the critical probability is usually taken as
0.05 (or 5%). This may seem very low, but it reflects the facts that biology experiments are expected to
produce quite varied results. So if P > 5% then the two sets are the same (i.e. accept the null hypothesis),
and if P < 5% then the two sets are different (i.e. reject the null hypothesis). For the t test to work, the
number of repeats should be at least 5.
In Excel the t-test is performed using the formula: =TTEST (range1, range2, tails, type) . For the examples
you'll use in biology, tails is always 2 (for a "two-tailed" test), and type can be either 1 for a paired test
(where the two sets of data are from the same individuals), or 2 for an unpaired test (where the sets are
from different individuals). The cell with
the t test P should be formatted as a
percentage (Format menu > cell > number
tab > percentage). This automatically
multiplies the value by 100 and adds the %
sign. This can make P values easier to read
and understand. It’s also a good idea to plot
the means as a bar chart with error bars to
show the difference graphically.
In the first example the yield of potatoes in
10 plots treated with one fertiliser was
compared to that in 10 plots treated with
another fertiliser. Fertiliser B delivers a
larger mean yield, but the unpaired t-test P
shows that there is a 8% probability that this
difference is just due to chance. Since this is
>5% we accept the null hypothesis that
there is no significant difference between
the two fertilisers.
In the second example the pulse rate of 8
individuals was measured before and after
eating a large meal. The mean pulse rate is
certainly higher after eating, and the paired
t-test P shows that there is only a tiny
0.005% probability that this difference is
due to chance, so the pulse rate is
significantly higher after a meal.
HGS Biology
NCM/11/01
A2 statistics page 5
ANOVA to Compare >2 sets of Data
The t test is limited to comparing two sets of data, so to compare many groups at once you need analysis
of variance (ANOVA). From the Excel Tools menu select Data Analysis then ANOVA Single Factor.
This brings up the ANOVA dialogue box, shown here.
 Enter the Input Range by clicking in the
box then selecting the range of cells
containing the data, including the
headings.
 Check that the columns/rows choice is
correct (this example is in three columns),
and click in Labels in First Row if you
have included these. The column headings
will appear in the results table.
 Leave Alpha at 0.05 (for the usual 5%
significance level).
 Click in the Output Range box and click
on a free cell on the worksheet, which will
become the top left cell of the 8 x 15-cell
results table.
 Finally press OK.
The output is a large data table, and
you may need to adjust the column
widths to read it all. At this point
you should plot a bar graph using the
averages column for the bars and the
variance column for the error bars.
The most important cell in the table
is the P-value, which as usual is the
probability that the null hypothesis
(that there is no difference between
any of the data sets) is true. This is
the same as a t-test probability, and
in fact if you try ANOVA with just
two data sets, it returns the same P
as a t test. If P > 5% then there is no
significant difference between any of
the data sets (i.e. the null hypothesis
is true), but if P < 5% then at least
one of the groups is significantly
different from the others.
In the example on this page, which concerns the grain yield from three different varieties of wheat, P is
0.14%, so is less than 5%, so there is a significant difference somewhere. The problem now is to identify
where the difference lies. This is done by examining the variance column in the summary table. In this
example, varieties 2 and 3 are very similar, but variety 1 is obviously the different one. So the conclusion
would be that variety 1 has a significantly lower yield than varieties 2 and 3.
Download:
HGS Biology
http://www.analystsoft.com/en/products/statplusmac/
NCM/11/01
A2 statistics page 6
Chi-squared Test for Frequency Data
Sometimes the data from an experiment are not measurements but counts (or frequencies) of things, such
as counts of different phenotypes in a genetics cross, or counts of species in different habitats. With
frequency data you can’t usually calculate averages or do a t test, but instead you do a chi-squared (2)
test. This compares observed counts with some expected counts and tells you the probability (P) that
there is no difference between them. In Excel the 2 test is performed using the formula:
=CHITEST (observed range, expected range) . There are three different uses of the test depending on how
the expected data are calculated.
Sometimes the expected data can be calculated from a quantitative theory, in which case you are
testing whether your observed data agree with the theory. If P < 5% then the data do not agree with
the theory, and if P > 5% then the data do agree with the theory. A good example is a genetic cross, where
Mendel’s laws can be used to predict frequencies of
different phenotypes. In this example Excel
formulae are used to calculate the expected values
using a 3:1 ratio of the total number of
observations. The 2 P is 53%, which is much
greater than 5%, so the results do indeed support
Mendel’s law. Incidentally a very high P (>80%) is
suspicious, as it means that the results are just too
good to be true.
1.
2.
Other times the expected data are calculated by assuming that the counts in all the categories should
be the same, in which case you are testing whether there is a difference between the counts. If P <
5% then the counts are significantly different from
each other, and if P > 5% then there is no significant
difference between the counts. In the example above
the sex of children born in a hospital over a period
of time is compared. The expected values are
calculated by assuming there should be equal
numbers of boys and girls, and the 2 P of 6.4% is
greater than 5%, so there is no significant difference
between the sexes.
If the count data are for categories in two groups, then the expected data can be calculated by
assuming that the two groups are independent. If P < 5% then there is a significant association
between the two groups, and if P > 5% then the two groups are independent. Each group can have counts
in two or more categories, and the observed frequency data are set out in a table, called a contingency
table. A copy of this table is then made for the expected data, which are calculated for each cell from the
corresponding totals of the observed data, using the formula E = column total x row total / grand total . In this
example the flow rate of a stream (the two categories fast / slow) is compared to the type of stream bed
(the four categories weed-choked / some weeds / shingle / silt) at 50 different sites to see if there is an
association between them. The
2 P of 1.1% is less than 5%, so
there is an association between
flow rate and stream bed.
3.
HGS Biology
NCM/11/01
A2 statistics page 7
Problems
1. In a test of two drugs 8 patients were given one drug and 8 patients another drug. The number of hours of relief from
symptoms was measured with the following results:
Drug A
3.2
1.6
5.7
2.8
5.5
1.2
6.1
2.9
Drug B
3.8
1.0
8.4
3.6
5.0
3.5
7.3
4.8
Find out which drug is better by calculating the mean and 95% confidence limit for each drug, then use an appropriate
statistical test to find if it is significantly better than the other drug.
2. In one of Mendel's dihybrid crosses, the following types and numbers of pea plants were recorded in the F2 generation:
Yellow round seeds
Yellow wrinkled seeds
Green round seeds
Green wrinkled seeds
289
122
96
39
According to theory these should be in the ratio of 9:3:3:1. Do these observed results agree with the expected ratio?
3. The areas of moss growing on the north and south sides of a group of trees were compared.
North side
20
43
53
86
70
of tree
South side
63
11
21
54
9
of tree
Is there a significant difference between the north and south sides?
54
74
4. Five mammal traps were placed in randomly-selected positions in a deciduous wood. The numbers of field mice captured in
each trap in one day were recorded. The results were:
Trap
A
B
C
D
E
no. of mice
22
26
21
8
23
Trap D caught far fewer mice than the others. Did this happen by chance or is the result significant?
5. In an investigation into pollution in a stream, the concentration of nitrates was measured at six different sites, and a
diversity index was calculated for the species present.
Site
1
2
3
4
5
6
413.3
439.7
726
850
567.3
766.7
Conductivity (S)
Diversity index
7.51
5.17
4.49
3.82
5.88
3.74
Is there a correlation between conductivity and diversity, and how strong is it? (The diversity index is calculated from biotic
data, so is not normally distributed.)
6. The blood groups of 400 individuals, from 4 different ethnic groups were recorded with the following results:
Ethnic group
Blood Group O
Blood Group A
Blood Group B
Blood Group AB
1
46
40
7
3
2
48
39
12
2
3
53
33
12
4
4
55
30
13
3
Is there as association between blood group and ethnic group?
7. The effect of enzyme concentration on rate of a reaction was investigated with the following results.
Enzyme concentration (mM)
0
0.1
0.2
0.5
0.8
1.0
Rate (arbitrary units)
0
0.8
1.1
3.2
6.6
7.2
Plot a graph of these results, fit a straight line to the data, and find the slope of this line. Use the slope to predict the rate at
an enzyme concentration of 0.7mM.
HGS Biology
NCM/11/01
A2 statistics page 8
Alternative flow chart, showing the non-parametric tests (which are not available in Excel):
Parametric Test
Testing for a
correlation
Plot
scatter
graph
Finding how
one factor
affects another
Testing for
a relation
between 2 sets
Calculate
mean and
95% CI from
replicates
Pearson correlation coefficient
=CORREL (range 1, range 2)
0 = no correlation
1 = perfect correlation
same
individuals
Spearman correlation coefficient
=CORREL (range 1, range 2)
on ranks of data
0=no correlation
1=perfect correlation
Linear regression
Add Trendline to graph and
Display Equation.
Gives slope and intercept of line
Parametric Test
What
kind
of
test?
Non-Parametric Test
Paired t-test
=TTEST(range1, range2, 2, 1)
If P<5% then sig. difference
If P>5% then no sig. difference
Non-Parametric Test
Wilcoxon Matched Pairs test
Not available in Excel
2 sets
Measurements
Testing for
a difference
between sets
Parametric Test
different
individuals
Plot
bar
graph
Unpaired t-test
=TTEST(range1, range2, 2, 2)
If P<5% then sig. difference
If P>5% then no sig. difference
Parametric Test
start
here
What
kind
of
data?
>2 sets
Frequencies (counts)
Comparing observed
counts to a theory
What
kind
of
test?
Testing for a difference
between counts
Testing for an association
between groups of counts
HGS Biology
ANOVA
Tools menu >Data analysis >Anova
If P<5% then sig. difference
If P>5% then no sig. difference
Non-Parametric Test
Mann-Whitney U-test
Not available in Excel
Non-Parametric Test
Kruskal-Wallis test
Not available in Excel
2 test
=CHITEST(obs range, exp range)
If P<5% then disagree with theory
If P>5% then agree with theory
2 test
=CHITEST(obs range, exp range)
If P<5% then sig. difference
If P>5% then no sig. difference
2 test for association
=CHITEST(obs range, exp range)
If P<5% then sig. association
If P>5% then no sig. association
NCM/11/01