Download Stats_bio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Data Analysis using Excel:
A great deal of information can be obtained by using statistical features available in a
program such as Excel.
The “Data Analysis” facility of Excel is available in the drop-down menu under
Tools (usually at the bottom).
(If it is not available in the drop-down menu from Excel (probably near the bottom of
the list) you will have to add it in by clicking on “Add-In”. Tick the top “AnalysisToolpak” and this will make “Data Analysis” available in the drop-down menu.)
Suppose we copy data as follows into Excel (this is made-up data but might represent
the number of insects found on 7 rose bushes in particular conditions).
1
2
3
4
5
6
7
Number of Insects
5
6
7
5
7
9
4
To get a variety of data analysis material, from Tools,
Select Data Analysis
Click on Descriptive Statistics
If you had not first highlighted the data, highlight it as the Input Range
(Note: Highlight only the column of actual values)
If you include the heading with the data tick this box for Label
Output range should be the cell where you want to display the data, e.g A10
Click on Summary Statistics
You should get something like: (You may need to make column A wider.
This can be done by double-clicking on the line between A and B at the top of
the spreadsheet)
Number of Insects
Mean
Standard Error
Median
Mode
Standard
Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
6.142857143
0.633530224
6
5
1.67616342
2.80952381
0.051881643
0.582443916
5
4
9
43
7
The mean, median and mode are, as described, measures of centre.
Range, Minimum and maximum give information on the spread of the
values.
Sum gives the sum of the 7 values and count gives the number of data values.
Note: Check a few simple values carefully. It is easy to get wrong
information from Excel, and is one of the reasons you need to think
carefully about what you put into a statistics package. It is easy to
include the label and get the following data:
5
Mean
Standard Error
Median
Mode
Standard
Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
6.333333333
0.714920353
6.5
7
1.751190072
3.066666667
0.014177694
0.248278366
5
4
9
38
6
The first data item has been counted as a label, and the count and
heading give an indication that this is a problem. However if you just
trust the computer, you will be performing analysis with wrong values.
Standard error gives a measure of the spread of how much the mean of the
population may vary from sample to sample. See notes on the next page
Standard deviation and Sample variance both give a measure of the spread of
the population data values. Sample variance is just the square of the sample
deviation. See notes below.
Kurtosis is a measure of how peaked the distribution is, and is not needed for
this course.
Skewness is a measure of whether there is a long tail of data on one side or other
of the centre, and is not needed for this course.
Standard error
Standard error is the standard deviation of the sample means.
Where we have had 7 values and produced a sample mean of 6.14, if we took a different
sample of 7 we would not expect to get the same sample mean of 6.14 again. It would be
slightly different.
If we kept taking samples of 7 from the same population, we would find that if we did a
histogram of all the means, they would fit a Normal distribution. They would be centred
at the same place the population mean is centred, but they would not be as spread out as
the original population. Because we are averaging 7 values, the effect of any extreme
values tend to be cancelled out by other values and we are likely to get a much narrower
range of values. For example finding a rose bush with 9 insects would not be unexpected,
but finding 7 bushes with a mean number of insects of 9 (that is finding 63 insects in total
on the 7 rose bushes) would be unexpected.
The bigger the sample size we take the smaller the deviation is likely to be. In fact the
s
standard deviation of all the sample means is
where s is the population standard
n
deviation (well, our estimate of it, since we really are trying to predict this) and n is the
number of samples taken.
Therefore the standard error found in the descriptive Statistics could have been calculate
by dividing the standard deviation by the square root of the count
1.676
= 0.633
7
In real life we do not take a lot of samples of size 7. We take one sample and from this
we calculate the sample mean and the sample standard deviation. Using this information
with the sample size, we make predictions about the whole population.
Suppose our sample came from a market garden with a large number of rose bushes from
which we randomly selected 7 to count insects. We found the sample mean was 6.14 and
the sample deviation was 1.676.
We use this to predict that the mean number of insects per rosebush for the whole
population of rose bushes is 6.14.
We predict that the population standard deviation is 1.676.
If any sample of 7 rose bushes is taken we predict that the mean of this sample would be
6.14.
If a large number of samples of size 7 were taken, we think the means would fit a normal
distribution with mean 6.14 and standard deviation of 0.633.
Insects per rose bush
Mean number of insects per
rose bush for samples of 7
rose bushes
Confidence Intervals
Often we want to predict what the mean value of a particular population is.
The confidence interval gives a range of values within which we expect the true mean
number of insects present in the set conditions will be.
In Excel, this can be found using a similar process as described for Descriptive Statistics,
but as well as ticking Summary Statistics, tick Confidence Interval
This will add the bottom line to the previous summary.
Number of Insects
Mean
Standard Error
: : :
:::
Confidence
Level(95.0%)
6.142857143
0.633530224
1.550193746
Using this, we can predict that 95% of the time the mean number of insects is between
6.14 - 1.55 and 6.14 + 1.55
That is, the mean number of insects on rose bushes is between 4.59 and 7.69
It is possible with other analysis to find the confidence interval for the data values
themselves or for proportions or the difference between the means of two populations.
Understanding how this calculation is made, when the sample size is less than 30,
requires statistics at University level. However a brief explanation is given here.
When we have a large enough sample size (about 30 or more), we use the Normal
distribution to create the predicted range. The Normal tables provided in statistics books
give the probability of a value being between 0 and a number. For example, the number
2.3 gives a table value of 0.4893. This means that the probability that a data value is
between the mean and 2.3 standard deviations above the mean is 0.4893.
In Excel = NORMSDIST(2.3) gives 0.989276 which is the probability that a data
value is less than (the mean + 2.3 standard deviations). That is, it includes the 0.5 that are
below the mean too.
That is, tables give:
Excel gives
P(0 < Z < 2.3) = 0.4893
P(Z< 2.3) = 0.9893
z
z
2.3
2.3
Note that because the Normal distribution is symmetrical, that this one probability value
also tells us:
P(-2.3< Z <0)
= 0.4893
P(Z > 2.3)
= 0.5- 0.4893 = 0.0107
P(Z < 2.3)
P(Z < -2.3)
P(Z > -2.3)
P(-2.3 < Z < 2.3)
= 0.5 + 0.4893 = 0.9893
= 0.0107
= 0.9893
= 2 x 0.4893 = 0.9786
The number 1.96 on Normal tables gives probability 0.4750 meaning the probability that
a value is between the mean and 1.96  the standard deviation is 0.475. The probability
that it is between 1.96 standard deviations below the mean and the mean is also 0.4750.
Therefore the probability that a data value is within 1.96 standard deviations of the mean
is 0.4750 + 0.4750 = 0.95.
From Excel =NORMSDIST(1.96) gives 0.975002
meaning 2.5% of values will be above a z value of 1.96
Hence we expect that 95% of the time values will lie within 1.96 standard deviations of
the mean.
So the 95% confidence interval for a data value is from
mean – 1.96 x std dev to
mean + 1.96 x std dev.
Thus if we know that the number of insects on rose bushes is a Normal distribution with
mean 6.14 and standard deviation 1.676, we expect that 95% of the time the number of
insects on any randomly selected rose bush would be between:
6.14 – 1.96 x 1.676 to
6.14 + 1.96 x 1.676
=
2.855 to
9.425
We use the sample mean and the sample standard deviation to predict the value of the
population mean and the population standard deviation. If the sample is of about size 30
or more, this is usually a reasonable prediction.
t-distribution
When the sample size is less than 30, the values provided in the Normal tables (or using
the NORMSDIST function on Excel) are too small. Because we are unsure of the
population values we have to take a wider interval to be 95% confident. The smaller the
sample size, the larger the number of standard deviations we have to be away to include
the true value 95% of the time. To find how many standard deviations we need to take,
we look up
t- tables (instead of Normal tables). On the computer this means using the TINV value.
To use these, however, the number of degrees of freedom (1 less than the sample size, in
general) has also to be given.
Normal distribution
t distribution, allowing for a
smaller sample size making
the 95% confisence band
much wider
z = 1.96
In our case, we had a sample size of 7 so we have 6 degrees of freedom. (Refer also to
page 21). In Excel we would use =TINV(0.05,6) to get the value 2.446914 as the number
of standard deviations that we need to be away.
The 95% confidence interval means that there is a 5% chance that we could get a value
outside this range. In a two-tailed confidence interval, this means we want 2.5% above
the higher value and 2.5% below. In t-tables, we therefore look up the 97.5 percentile
with 6 degrees of freedom to get the value 2.447.
From this, we can calculate the 95% confidence interval for the number of insects on a
rose bush to be between
6.14 – 2.447 x 1.676 and 6.14 + 2.447 x 1.676
=
2.0388
and
10.2412
Note: the confidence Level (95%) provided in Excel is 1.550193746, whereas the above
confidence level is 2.447 x 1.676 = 4.101172. This is because the value given is the
confidence level for the mean number of insects in samples of size 7, not the actual
number and
1.676
2.447 x
= 1.55
7
If we want a 95% confidence interval for the mean of the number of insects per bush in a
sample of 7 rose bushes, rather than use the standard deviation, we use the standard error.
Therefore we calculate the 95% confidence interval for the sample mean for insects per
bush to be:
1.676
1.676
6.14 – 2.447 x
and 6.14 + 2.447 x
7
7
=
4.59
and
7.69
Note: it is also possible to set up confidence intervals for
 differences between the means of two populations, where standard error is found
2
2
s1
s
 2 where s1 is the standard deviation of one population, s2 is the
n
n
standard deviation of the other and n is sample size,
degrees of freedom = df = (n-1) when data is paired or 2(n-1) if data is not paired
(that is there is no link in each row, such as a before and
after treatment or
twins, etc)
as
Refer to next section on difference between means

proportion of an organism that will display a particular behaviour, where standard
p(1  p)
error is
where p is the proportion displaying the property and n is
n
the sample size, df = n – 1.
Difference between two means
Suppose we want to compare some treatments.
For example some process has been carried out to attract insects and the results of
observations are:
Before
5
6
7
5
7
9
4
After
7
9
12
10
8
6
6
No matter whether there has been an effect or not, we would expect there to be some
differences between the results before and after simply because the results were taken
independently and at different times.
We want to know if there is a significant difference between the results and look at the
difference between means.
Our Descriptive Statistics package gives:
Before
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
After
6.14285714
0.63353022
6
5
1.67616342
2.80952381
0.05188164
0.58244392
5
4
9
43
7
1.55019375
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95.0%)
8.285714286
0.837066468
8
6
2.214669706
4.904761905
-0.42320671
0.65757466
6
6
12
58
7
2.048229359
But this does not tell us if there is any significant difference between the two sets of data.
There are on average more insects after treatment, but could this just have occurred by
chance?
We can get a confidence interval for this difference in means.
Mean difference found = 8.285714286 - 6.14285714 = 2.142857
Standard error = 0.63353022 2  0.837066468 2 = 1.049781
t-value with 12 degrees of freedom and 95% confidence is 2.179
Therefore 95% of the time we expect the true difference to be between
2.143 – 2.179  1.050 and 2.143 – 2.179  1.050
= -0.144 and 4.43
Since 0 lies in this interval the true difference could be 0 so there is not enough evidence
to say that this difference is significant.
t-test
Another way to determine if there is a significant difference is to do a t test.
Using the function =TTEST(B2:B8,C2:C8,2,2) we can get the probability of this
difference occurring by chance.
Note: This function is available in Excel by Insert from drop down menu, then Function
and choose TTEST. This will guide you through what you need to enter.
In this case it produces the value 0.063851 so there was a 6% chance that this difference
occurred simply by chance. Traditionally in Statistics we consider 5% to be the critical
value, so we would report that there was not a significant difference between the before
and after treatment in the number of insects.
The t test involves calculation of the mean differences between the two sets. In setting up
a confidence interval, the test involves determining the probability that the difference we
obtained would be this far from 0.
In =TTEST(B2:B8,C2:C8,2,2) the 4 sets of information in brackets are:
B2: B8 showing the cells that hold the first set of data
C2:C8 showing the cells that hold the second set of data.
2 We then determine the number of tails – 2 in this case. That is, we are interested
to determine if the two treatments are not equal.
If we had only been interested in whether the treatment had increased the
number of insects we would have used a one-tail experiment. The 5%
chance of making a wrong prediction (when we think of 0.05 as the critical
value) is all at one tail end now instead of having 2.5% at each end.
(In fact, in this case if we had used the one-tailed test
=TTEST(B2:B8,C2:C8,1,2) we would have had a result of 0.031925 (and
this would have been a significant result.)
95% of the time we
expect the true mean
difference is in this
interval
Two-tailed with 2.5% at each
end
95% of the time we
expect the true mean
difference to be greater
than the lower bound
One-tailed test with 5% of values at one
end.
2 The final 2 is an indication of the type of data we are using. If the data was
paired (that is there was a link between data recorded in 1 as , say being both
on Day 1 of a 7 day experiment) we would have entered 1 for this type. If
we have data that comes from populations where the variance of each
population is different, we would have entered 3 for this type of test.
More details can be obtained using the Data Analysis method on Excel:
t-Test: Paired Two Sample for Means in place of Descriptive Statistics on page 23
This produces
t-Test: Paired Two Sample for
Means
Mean
Variance
Observations
Pearson Correlation
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Before
6.142857143
2.80952381
7
0.032069736
0
6
2.073490549
0.041741629
1.943180905
0.083483258
2.446913641
After
8.285714286
4.904761905
7
The “t Stat” value shows that the sample difference is 2.07 standard errors above the
mean.
If a one-tailed test is used, the probability of this occurring by chance is 0.0417 because
95% of the time the difference would be less than the critical value 1.94.
Therefore if we are only interested in whether the “After” treatment is greater than the
“Before” treatment, we have reason from this data to conclude that this is true.
If a two-tailed test is used – that is we are interested if they are different, the critical tvalues are -2.447 and +2.447. Since this sample difference corresponds to a t-value of
2.07, which is in the 95% confidence interval, we cannot conclude that there is any
significant difference.
The actual confidence interval can be calculated using:
Two - tailed confidence interval:
Difference
2.142857143
Std error diff
1.049781318
t value
2.178813
Lower bound
-0.14442
Upper bound
4.430134
found by subtracting means
found by taking the square root
of the sum of the variances
divided by 7
found by TINV(0.05,12) as
95% confidence and 12
degrees of freedom
difference - t value * std error
diff
difference + t value * std error
diff
95% confidence interval
is
from
-0.14442
to
4.430134
Conclusion: Since 0 is inside this interval we do not have enough
evidence to show that there is a significant difference.
ANOVA
This fancy sounding title just stands for ANalysis Of Variance. This is used where two or
more sets of data are being compared for differences of means
For the above data, using Excel this is produced using From the Tools menu select
“Anova: Single Factor” On the options available, enter:
Input range
as
B1:C8
(or wherever data entered)
Grouped by
select
Columns
Labels in 1st column in this box
tick
Output range
Chose cell
A18
(say)
Click OK
You should get the following (without any bolding):
Anova: Single
Factor
SUMMARY
Groups
Before
After
Count
7
7
ANOVA
Source of Variation
Between Groups
Within Groups
SS
16.0714286
46.2857143
Total
62.3571429
Sum
43
58
Average
6.142857143
8.285714286
Variance
2.809524
4.904762
df
MS
16.07142857
3.857142857
F
4.166667
1
12
P-value
0.063851
F crit
4.747221
13
Data that is useful for reporting the results of an experiment from this is highlighted.
The mean number of insects before treatment and the mean number after treatment are
given under the Average column. To give the standard deviation of these, the square root
of the variance can be calculated.
The p-value (the same as calculated by the t test because there were only 2 treatments
being compared) is the probability that such a difference happened by chance.
SS means sum of squares and relates to calculations of the variances (which are just
the result of adding up squared deviations from the mean)
df gives the degrees of freedom (and it is not essential to understand these for this course.
There are 2 treatments and 1 is subtracted to give the degrees of freedom for
treatments)
MS is the mean squared deviations (SS divided by df)
F is a ratio of the variance between groups to the variance within groups
MS between
(
).
MS within
F crit is the value above which any F value would be considered to mean that there
was a significant difference. As can be seen in this case it is just below so the
result is said to be not significant.
It is more likely that you will use ANOVA with a greater number of treatments.
e.g
Temperature:
Batch 1
Batch 2
Batch 3
Batch 4
5oC
14
15
18
11
10oC
12
14
8
17
15oC
8
4
3
6
20oC
10
8
5
7
25oC
15
12
7
16
producing:
Anova: Single
Factor
SUMMARY
Groups
Count
5
10
15
20
25
4
4
4
4
4
ANOVA
Source of Variation
Between Groups
Within Groups
SS
211.7
115.5
Total
327.2
Sum
48
35
21
40
60
df
4
15
Average
12
8.75
5.25
10
15
Variance
10
6.25
4.916667
12.66667
4.666667
MS
52.925
7.7
F
6.873377
P-value
0.002375
F crit
3.055568
19
In this case your conclusion would be that temperature has a significant effect on
whatever is being measured.
The p-value is well below 1% so is very highly significant.
This does not tell you what the actual effects are, though, so you would need to look at a
graph or compare pairs of treatments to determine any pattern or relationship.
Chi squared test
The most common application for chi-squared is to compare observed counts of
particular cases with the expected counts. If these counts are less than 5, this test does
not work well.
Example:
The results of 80 slaters turning have been recorded. 36 slaters were in the control group
and were not forced to turn previously and 52 were forced to turn left before getting a
choice of direction to turn.
Forced left turn
No turn (control)
This has totals:
Number turning left
38
16
Number turning right
14
20
54 turning left and 34 turning right
88 slaters in total
If these events had occurred completely randomly, we would have expected:
Forced left turn
No turn (control)
2=
Number turning left
54
88  52 =31.9
54
88
Number turning right
34
88  52 =20.1
 36 = 22.1
34
88
 36 =13.9
(38  31.9) 2 (14  20.1) 2 (16  22.1) 2 (20  13.9) 2



= 7.378
31.9
20.1
22.1
13.9
The  2 critical value with (2-1)(2-1) = 1 degree of freedom is 3.84
(found on Excel using = CHIINV(0.05,1) which gives 3.84145534)
The number of degrees of freedom is found by
(number in row of data – 1)  (number in column of data results – 1)
Therefore if we had 3 choices of Forced left turn, No turn, Forced right turn, the
number of degrees of freedom would have been (3 - 1)  (2 – 1) = 2
Since the value we have calculated (7.378) is greater than 3.84, we can conclude that
whether a slater turns left or right is not independent of whether it was forced to turn left
last time or not.
In Excel, the calculations have to be made of expected value and stored in cells in the
same way that the observed values are recorded. This can be done using cell references,
so that no mistakes are made in rewriting numbers.
A function =CHITEST can then be entered by giving the range of cells where the
observed data is, and the range of cells where the expected data is. This will return the
probability of such data happening randomly if the two factors being looked at are
independent.
Example:
Control
Medicine
1
Medicine
2
Total
Control
Medicine
1
Medicine
2
Immediate
3
Recovery Rate
Within 2 days
Within 4 days
8
15
Not
24
Total
50
6
15
8
3
32
4
7
17
18
46
13
30
40
45
128
Immediate
5.078125
Recovery Rate
Within 2 days
Within 4 days
11.71875
15.625
Not
17.578125
3.25
7.5
10
11.25
4.671875
10.78125
14.375
16.171875
Each expected value is calculated by dividing the column total by the
bold total and multiplying by the row total.
For example, to calculate the expected value in the cell for control
group immediate recovery enter in the cell = (click into cell with Total
immediate) / = (click into cell with Total Total) * = (click into cell with
Total Control)
giving = 13/128* 50 = 5.078125
Chi square result:
0.000870912
(found by entering = CHITEST (B3:E5, B11:E13))
Note: The slater turning could be analysed showing calculations of (Obs – Exp) as:
Actual
Forced
Control
Left
38
16
54
Right
14
20
34
Expected
Forced
Control
Left
31.90909
22.09091
Right
20.09091
13.90909
52
36
88
(Obs - Exp)sqd
Exp
(Obs - Exp)sqd / Exp
37.09917
31.90909
1.162652
37.09917
20.09091
1.846565
37.09917
22.09091
1.679386
37.09917
13.90909
2.667261
0.006684
3.841455
0.006684
probability found by =CHITEST(B2:C3,B7:C8)
the critical value of Chi squared, found using CHIINV (0.05,1)
probability found by =CHIDIST(H12,1) where H12 contains the sum.
Sum:
7.355865
A warning with statistical interpretation: Lurking variables
Statistics can only be as good as the person who designs the experiment and considers all
possible influences. If the experimenter does not consider the effect of other variables,
wrong conclusions can be drawn.
Example 1: If the length of a boy’s trousers is plotted against their reading level, it may
appear that the longer the trousers, the better they can read. In reality the height increase
has largely been brought about with age increase and so has the reading level, so a child’s
age not being considered can cause misconceptions. Age is the lurking variable. Just
because there is a relationship between two things does not mean that one is the cause of
the other.
Reading level
Reading ability and trouser length
10
9
8
7
6
5
4
3
2
1
0
0
20
40
60
80
100
120
Trouser length (cm)
A conclusion that you can improve your reading ability by lengthening your trousers
would be a sad consequence of this!!
Example 2: The following graph would lead to the conclusion that there is no relationship
between moisture and wheat production.
Yield of wheat (tonnes /
ha)
Wheat production and moisture
5.9
5.7
5.5
5.3
5.1
4.9
4.7
4.5
2
2.2
2.4
2.6
Moisture (mm/day)
2.8
3
3.2
However, when the lurking variable temperature is considered, the results can be
interpreted differently.
Moisture effect on Wheat yield
Wheat yield
(tonnes/ha)
5.9
5.7
5.5
5.3
5.1
4.9
4.7
4.5
2
2.2
2.4
2.6
2.8
3
3.2
Moisture (mm/day)
Sunny
Mild
Cold
Example 3: Consider a new treatment being applied to a group of randomly
selected patients. Results for the group are:
All Patients
Improved: Not improved: Percentage improved:
New treatment:
20
20
50%
Standard treatment:
24
16
60%
TOTAL
44
36
55%
It appears that the new treatment is not as effective as the standard treatment (60%
improved using the standard treatment and only 50% improved using the new treatment).
However when this group is broken down by gender, the following is found:
Male patients
Improved: Not improved: Percentage improved:
New treatment:
12
18
40%
Standard treatment:
3
7
30%
TOTAL
15
25
38%
Female Patients:
Improved: Not improved: Percentage improved:
New treatment:
8
2
80%
Standard treatment:
21
9
70%
TOTAL
29
11
73%
It is clear that the new treatment has had a higher success rate for both males and
females.
The problem here is that females have a much higher rate of improvement than men,
and most women were given the standard treatment. Gender is a lurking variable and
the gender imbalance has caused a problem that the statistical analysis could not pick
up.
This is called Simpson’s Paradox