Download Topic guide 3.3: Processing data using statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Unit 3: Analysis of scientific data and information
.
33
Link
Learners unfamiliar with basic
statistics may wish to study
F/502/5547: Unit 8: Using Statistics for
Science from Edexcel BTEC Level 3 in
Applied Science.
Processing data using
statistics
Collected data will possess mathematical quantities that allow us to make
qualitative judgements. Quantities of samples are known as statistics. Some
are directly determined from the data, whereas others are found by comparing
the data to known mathematical patterns.
On successful completion of this topic you will:
•• be able to process data using statistics (LO3).
To achieve a Pass in this unit you need to show that you can:
•• perform descriptive statistics on a sample of continuous data (3.1)
•• demonstrate the nature of normal distributions using a sample of
continuous scientific data (3.2)
•• carry out hypothesis testing using standard statistical tests and draw
conclusions (3.3).
1
Unit 3: Analysis of scientific data and information
1 Descriptive statistics
1.6
1.0
1.4
1.2
2.0
1.4
1.6
1.4
1.6
1.8
1.6
1.6
1.6
1.2
1.4
2.0
1.8
1.0
2.0
2.0
1.8
1.6
1.6
1.4
1.2
1.0
1.2
1.4
Table 3.3.1: Results of a survey
of plant stem diameters.
Key terms
Mean: The arithmetic mean is equal
to the sum of all values divided by the
total number of values in the set. Also
known as the average.
Median: The value that lies precisely
50% of the way through the ranked
data set.
Mode: The most common value(s)
in the set (also known as the modal
value).
Standard deviation: A measure
of how dispersed the set values are
about the mean of the set.
Coefficient of variation: The ratio
of the standard deviation to the
mean.
Table 3.3.2: Notation used to
distinguish between whole populations
and samples of populations when
considering statistical variation.
Raw data can be analysed descriptively using mathematical steps to produce
figures that provide an insight into the nature of the data set. Measures of central
tendency are indicators as to how values in a data set cluster about a particular
value. The three most common are:
•• mean – also known as the average, the arithmetic mean is equal to the
sum of all values divided by the total number of values in the set
•• median – the value that lies precisely 50% of the way through the ranked
data set
•• mode – the most common value(s) in the set (also known as the
modal value).
Example
Determine the mean, median and mode in plant stem diameters, in mm,
measured in a survey (results shown in Table 3.3.1).
1.6 + 1.0 + 1.4 + . . .
Mean =
= 1.514 . . . = 1.5
28
Median = 1.6
Mode = 1.6
It should be noted that a single measure of central tendency provides a limited
amount of information by itself. For example, the mean in the example above
tells us nothing about the maximum or minimum diameters recorded, the
spread in the results or whether more diameters larger or smaller than the mean
were noted.
To examine the nature of the values in a data set further, we must consider what
are known as measures of dispersion. These are figures that indicate how the
values are distributed in the set. Two common measures are:
•• standard deviation
•• coefficient of variation.
The two are directly related – standard deviation is a measure of how dispersed
the set values are about the mean of the set and the coefficient of variation is the
ratio of the standard deviation to the mean. A note concerning notation needs
to be observed, however, before examining the formulae for these measures.
Data may be derived from a population (all possible values) or from a sample of
the population (the latter being the most common in scientific experiments). To
provide the reader of statistical analyses with a means to differentiate between the
two, the following notation is used (see Table 3.3.2):
Quantity
Sample
Population
value
x
X
mean
x
μ
standard deviation
sn
σ
coefficient of variation
cv
cv
3.3: Processing data using statistics
2
Unit 3: Analysis of scientific data and information
There are several formulae for standard deviation but we shall initially consider
just the following, based on a sample:
Sf(x – x)2
Sf
sn =
where f is the frequency of each value x, and x is the mean of the set of values.
Should there be a frequency of just one for each value x, then the formula becomes:
S(x – x)2
n
sn =
where n is the sample size.
A limitation of this formula is that if this standard deviation of the sample is used
as an estimate of the population’s standard deviation, it will produce a biased (in
this case, too low) estimate for sample sizes smaller than 50 or so. A version of the
formula for such samples is given by:
S(x – x)2
n–1
s=
where the correction factor (n − 1) helps to reduce the effect of the bias. Given that
for large values of n (e.g. 500), n – 1 < n, this means you can just use this formula
for all samples.
The coefficient of variation is given by the formula:
cv =
sn
x
Example
Using the data from the previous example, determine the standard deviation and
coefficient of variation of the values shown below (Table 3.3.3).
Table 3.3.3: Results of a survey of plant
stem diameters grouped according to
the frequency of each value, allowing
the standard deviation and coefficient
of variation to be calculated.
x
f
x − μ
(x − μ)2
1
3
−0.51429
0.26449
1.2
4
−0.31429
0.098776
1.4
6
−0.11429
0.013061
1.6
8
0.085714
0.007347
1.8
3
0.285714
0.081633
2
4
0.485714
0.235918
S f = 28
S f(x – m)2 = 0.701224
σ = 0.158252
cv = 0.104506
When working with grouped data, we can only calculate estimates of the
measures of central tendency and dispersion. This is because the exact values of
all the data points are unknown; only the number of entries in each group has
been measured.
3.3: Processing data using statistics
3
Unit 3: Analysis of scientific data and information
The mean is calculated using the formula:
m=
Sfx
Sf
where x is the mid-point of each group. The mode is simply the group with the
highest frequency. However, the calculation required to estimate the median is
a little more involved than finding the mean or mode. One method is to use a
relative cumulative frequency curve (also known as an ogive) as described earlier
in Figure 3.1.2; this is a plot of the upper class boundaries for each group against
the relative cumulative frequencies. If we are grouping data to make the analysis
easier to process, we must ensure that the group boundaries do not overlap
with one another.
Example
Determine the mean, median and mode for the grouped data shown in
Table 3.3.4.
Table 3.3.4: A sample of grouped
data, from which the mean, median
and mode can be determined.
Mass (g)
f
30 d m < 40
12
40 d m < 50
27
50 d m < 60
19
60 d m < 70
2
The modal class is clearly 40 d m < 50. Table 3.3.4 can be expanded to show
mid-points, upper class boundaries, etc. (see Table 3.3.5):
Table 3.3.5: A sample of grouped
data, expanded to show
mid-points, upper class boundaries
and cumulative relative frequencies.
Mass (g)
f
Mid-point
U.C.B.
Relative f
Cumulative
relative f
30 d m < 40
12
35
40
20%
20%
40 d m < 50
27
45
50
45%
65%
50 d m < 60
19
55
60
31.7%
96.7%
60 d m < 70
2
65
70
3.3%
100%
Figure 3.3.1: Relative cumulative
frequency graph corresponding to
the data in Table 3.3.5, showing
the estimated median mass.
Relative cumulative frequency (%)
The cumulative frequency graph for this data is shown in Figure 3.3.1:
100
90
80
70
60
50
40
30
20
10
0
3.3: Processing data using statistics
30
35
40
45
50
55
60
65
70
Mass (g)
4
Unit 3: Analysis of scientific data and information
From Figure 3.3.1, the estimated value for the median is observed to be
approximately 46.5 g.
Calculations of the standard deviation and coefficient of variation are carried out
using the mid-point values of the groups.
It is preferable to perform such tasks using technology with all but the smallest of
data sets, since manual calculations present too many opportunities for errors and
checking for these is time-consuming. Software such as Microsoft® Excel® contains
several functions to perform some of these tasks (see Table 3.3.6):
Table 3.3.6: Common statistical
functions in Microsoft® Excel®.
Link
The material covered in this section
is prerequisite knowledge for Unit 10:
Statistics for experimental design.
Checklist
At the end of this section you should
be able to calculate the following
descriptive statistics:
 mean, mode, median
 standard deviation
 coefficient of variation.
Statistic
mean
=average()
median
=median()
mode
=mode()
standard deviation
=stdev() and =stdevp (sample / population)
Although the procedures to determine the measures of central tendency and
dispersion of a set of data are trivial, assuming no mistakes have been made, the
calculated figures are not open to interpretation – they are quantities directly
extracted from the data. However, the conclusions made on the basis of the
figures produced require care; for example, where the standard deviation must be
viewed in context with the mean, the coefficient of variation already accounts for
this. But the value of the latter is sensitive to small changes in the mean when it is
close to zero.
Example
Two samples, A and B, are recorded and their respective standard deviations
are both found to be 1.38. The mean of A is 0.015 and B is 0.011; what are their
coefficients of variation?
Key terms
Statistical inference: The process
of drawing conclusions from data
subject to random variation.
Distribution: A function or
graphical representation thereof
showing the frequency with which
values in a data set tend to show
different degrees of deviation from
the mean.
Function
cv A =
1.38
= 92
0.015
cv B =
1.38
= 125.45
0.011
Sample B has a coefficient of variation 36% larger than that of A, even though its
mean is only 27% smaller.
A large or small measure of central tendency or dispersion is not necessarily ‘good’
or ‘bad’ – they simply provide reference points from which other statistics can be
construed (known as statistical inference).
2 Normal distributions
If you performed an investigation, such as recording the surface area of every
leaf from a tree or the heat energy released in many repeats of the same standard
thermite reaction, the measured values would typically accumulate to produce a
distribution of values as shown in Figure 3.3.2. The frequency of the measured
values is displayed on the y-axis, with the values themselves on the x-axis.
3.3: Processing data using statistics
5
Unit 3: Analysis of scientific data and information
As covered in the previous section of this unit, you can descriptively note the
central tendency and dispersion.
Figure 3.3.2: A normal distribution,
with values on the y-axis and
frequency on the x-axis.
σ
σ
μ
Such distributions of data are very common in scientific investigations and they
are called normal distributions. The key features of a normal distribution are:
•• the mean, median and mode all have the same value
•• the distribution is symmetrical about the mean
•• approximately 68% of the data set lies within one standard deviation either
side of the mean.
Key terms
Normal distribution: A commonly
observed distribution in data sets
in which the mean, median and
mode all have the same value, the
distribution is symmetrical about
the mean and approximately 68% of
the data set lies within one standard
deviation either side of the mean.
Quantile: Any regular division of the
cumulative frequency distribution
of a data set. If a data set is divided
into 100 quantiles, they are known as
percentiles.
Z-score: The number of standard
deviations a value is placed above
the mean.
There are other well-known distributions that commonly appear in science (e.g.
beta distribution, chi-squared distribution, Student’s t-distribution). Although the
normal distribution appears frequently, you cannot assume that any collected data
follows this pattern and there are various tests to assess how closely a set of data
matches a normal distribution (known as normality testing). Statistical software
packages, such as SPSS® or MATLAB®, can quickly analyse data to assess normality
but, if such bespoke applications are not available, then a reasonably robust
method can be performed in spreadsheet programs such as Microsoft® Excel®: a
normal quantile plot. Normality can be tested in three steps as shown below, with
an example presented in Table 3.3.7.
1 The first step is to produce a single ranked list of the data, starting with the
smallest value.
2 The next step is to determine which cumulative proportion (also known as
the quantile) each data point would have, based on its rank. To do this, use
the rank function in one column and then the count function in another
to calculate the quantile. The inverse normal function can then be used to
produce the theoretical z-scores for these quantiles.
3 Finally, the data points are plotted against the z-scores; the closer the plotted
points are to a straight line, the closer the data set is to a normal distribution.
Example
The distribution of the data set shown in Table 3.3.7 is unknown, but can be
established by ranking, assigning quantiles and calculating z-scores that
measure the number of standard deviations the value is above the mean (hence
the lowest values have negative z-scores). The calculations can be performed
automatically in Microsoft® Excel® (Table 3.3.8) and plotting values against
3.3: Processing data using statistics
6
Unit 3: Analysis of scientific data and information
z-scores shows that the data are not normally distributed, because the lower
values of x fall below the line (see Figure 3.3.3).
Table 3.3.7: Data set to be tested for
normal distribution by ranking, quantile
assignment and z-score calculation.
Table 3.3.8: Calculation of ranking,
quantile and z-scores in Microsoft® Excel®.
x
Rank
Quantile
z-score
0.70
1
0.03
−1.8339
0.77
2
0.10
−1.2816
0.98
3
0.17
−0.9674
1.12
4
0.23
−0.7279
1.56
5
0.30
−0.5244
1.80
6
0.37
−0.3407
2.00
7
0.43
−0.1679
2.00
7
0.43
−0.1679
2.50
9
0.57
0.1679
2.90
10
0.63
0.3407
3.00
11
0.70
0.5244
3.50
12
0.77
0.7279
4.10
13
0.83
0.9674
5.00
14
0.90
1.2816
6.02
15
0.97
1.8339
A
B
C
D
1
x
Rank
Quantile
Z-score
2
0.7
=RANK(A2, range A col, 1)
=(B2-0.5)/(count(B column)
=normsinv(C2)
Z-score
Figure 3.3.3: Plot of values against
z-scores for the data set shown in
Table 3.3.7. A normal distribution
would be shown by a straight
line. The data are not normally
distributed, because the lower
values of x fall below the line.
3.0
2.0
1.0
0.0
Take it further
x
2
4
6
–1.0
There are more detailed approaches
to determine normality. Use the
Internet to search for Jarque-Bera and
Lillefors tests.
–2.0
–3.0
Standardisation
As discussed above, the z-score (or z-value or standard value) for a data point is
a measure of the number of standard deviations the value is above the mean.
The process of converting data points into z-values is called standardisation
(sometimes also called normalisation) and is done using the formula:
3.3: Processing data using statistics
7
Unit 3: Analysis of scientific data and information
z=
x–m
s
It should be noted that the mean itself has a z-score of zero and that a value of x
that is equal to the standard deviation will have a z-score of 1. If you standardise
the normal distribution, it looks like the graph shown in Figure 3.3.4:
Figure 3.3.4: A plot of value frequency
against z-score for a normal distribution.
–4
–3
–2
–1
0
1
2
3
4
Z-value
Percentiles
Z-score
Key terms
Standardised: Data that has been
converted into z-scores, showing the
number of standard deviations above
the mean.
Population: A complete data set.
Percentage of data between mean and z-score (2 d.p.)
Percentile (2 d.p.)
8
Table 3.3.9: Relationship between
normally distributed data,
percentiles and z-scores.
–
50.00%
0.00th
−4
49.99%
0.01th
−3
49.86%
0.14th
−2
47.72%
2.28th
−1
34.13%
15.87th
0
0.00%
50.00th
1
34.13%
84.13th
2
47.72%
97.72th
3
49.86%
99.86th
4
49.99%
99.99th
8
In a normal distribution, standardised or otherwise, 68.2% of the set of data will lie
within one standard deviation (i.e. a z-score of 1) either side of the mean;
95.4% will lie within two standard deviations (Table 3.3.9).
50.00%
100th
This property of the distribution is extremely useful: statistical hypotheses can
be tested on the basis that values with z-scores more than ±3 are rare. A natural
process that follows a normal distribution can be expected to repeatedly produce
values within four standard deviations for 99.99% of the time the population is
sampled.
3.3: Processing data using statistics
8
Unit 3: Analysis of scientific data and information
Link
The material covered in this section is prerequisite knowledge for Unit 10: Statistics for experimental
design and the concepts of samples of populations, standard errors and confidence limits are
examined to an appropriate level.
Checklist
At the end of this section you should be able to:
 demonstrate the nature of normal distributions using a sample of continuous scientific data by
conducting a normality test such as a normal quantile plot.
3 Statistical testing
The logical step forward from examining one set of data or the results from one
experiment is to compare them against other sets of information. For example,
testing to see if a given catalyst genuinely increases the rate of a chemical reaction
for all substances or that a newly created anti-inflammatory drug is more effective
than currently used drugs.
Hypothesis testing
Key terms
Hypothesis: A proposed
explanation for an observation, which
can be tested using the scientific
method.
Null hypothesis: A general or
default position that there is no
relationship between two measured
phenomena.
Alternative hypothesis: A
rival to a null hypothesis, typically
postulating that there is a statistical
relationship between two
phenomena, which must be proven
by applying a statistical hypothesis
test.
P-value: A measurement of the
probability of observing a given value
in a data set assuming that the null
hypothesis is true.
Statistical hypothesis testing involves an assessment of scientific data or
information in such a way as to judge whether or not any observed patterns are
present purely by chance, as measured against a pre-determined probability
limit. This is most commonly done using the so-called null hypothesis (denoted
mathematically by H0) – a statement that there is no relationship between two
recorded events or that a tested item does not affect the bodies that it has been
applied to. The alternative hypothesis (H1) is effectively an opposing statement:
that there is a relationship present.
The default position in hypothesis testing is that the null hypothesis is the one that
should be accepted; this means that the values and patterns found in the sets of
observed data have arisen purely by chance. To begin the analysis, a statistic of the
data, such as the mean or the frequency of a specific category, needs to be chosen
and calculated. Then the appropriate distribution for this statistic needs to be
selected; for example, would calculating the means from numerous samples result
in a normal distribution?
Next, you calculate the probability (known as the p-value) that a number at least
as large/small as the tested statistic would appear in the selected distribution.
For example, the means of samples often have normal distributions and let us
suppose that the distribution of a collection of 1000 samples has a mean of 10
and a standard deviation of 2. Another sample is taken and its mean is found to
be 13.5; in the normal distribution, this would have a z-score of +0.75 and the
resulting p-value would be about 0.04. This is quite a low probability, suggesting
that such a mean is unlikely to occur by chance.
But how do we really know this? We cannot know for certain but we can make
a decision that if the p-value is smaller than an arbitrarily chosen cut-off point
3.3: Processing data using statistics
9
Unit 3: Analysis of scientific data and information
Key term
Significance level/value: An
arbitrarily chosen cut-off point (α)
above which an observed value is
deemed likely to have occurred by
chance rather than being statistically
significant.
(called the significance level or value, denoted usually with the letter α), we can
say that, since it is unlikely that such a statistic will occur by chance alone, it is
unlikely that the null hypothesis is true and so we must reject it in favour of the
alternative hypothesis on the basis of this evidence. Thus, if we had chosen a value
of 5% or 0.05, then we are effectively saying that the chance of the given test
statistic appearing is no lower than 5% and anything lower than that is not likely
to appear by chance.
Note that in either case, the null hypothesis cannot be completely proven or
disproven – collected data provides evidence with which to make judgements, not
certainties. No statistical test can say whether either of the hypotheses themselves
is actually true or false.
Displaying hypotheses mathematically
The previous statements about the null hypothesis, p-value and significance level
can be summarised more conveniently using mathematical notation. Using the
examples stated, we would write the following:
population: m = 10; s = 2
sample: x = 13.5
H0: x = 10 stating that the mean of the sample should be about 10
H1: x > 10 stating that the mean of the sample should be more than 10
z-score: z = +0.75
p-value: p = 0.04, the probability of a value in a normal distribution having a
z-score equal to or greater than +0.75
significance level: α = 0.05
Since p < α, the sample suggests that it is unlikely that the observed statistic
occurred just by chance, and H0 should be rejected in favour of H1.
Significance levels
Strictly speaking, the significance level of a hypothesis test is the probability that
the null hypothesis is incorrectly rejected in favour of the alternative one. The
smaller the level, the greater the evidence the data needs to present in order to
reject the null hypothesis. Typical levels are 5% and 1%, but there are no set rules
for determining what level to use and, even if H0 is rejected at one level, there will
always be a lower level where H0 cannot be rejected.
Although arbitrary, the level needs to be chosen with care. You might believe it
is always best to choose a low value all the time (e.g. 0.01%) to be ‘statistically
certain’ that the null hypothesis is correctly rejected or not rejected but, in doing
so, it could result in the p-value always being larger than the significance level, no
matter how the experiment is conducted. The subjective nature of selecting the
significance value of a hypothesis test has proven to be a matter of controversy
among academics for some time, and it has been argued that more emphasis
should be placed on the p-value itself, rather than the significance level.
One-tailed, two-tailed tests
Another matter of subjectivity is the choice of uniformity or direction in the null
hypothesis: namely, whether it is one-tailed or two-tailed. Originally referring
3.3: Processing data using statistics
10
Unit 3: Analysis of scientific data and information
to the extreme ends of the normal distribution curve, it is now used to identify
whether you are testing for a specific difference between the alternative and
null hypotheses parameters (one tail) or testing for just any difference at all
(two tail).
Example
Table 3.3.10: Comparison of the onetailed and two-tailed tests for testing
specific or general differences between
the null and alternative hypotheses.
Link
The scope of statistical testing in
this unit is limited; it is examined in
more detail in Unit 10: Statistics for
experimental design.
One-tailed test
Two-tailed test
H0: μ = 12
H1: μ > 12
H0: μ = 12
H1: μ ≠ 12
The null hypothesis (H0) is that the mean (μ)
is 12. The alternative (H1) is that the mean is
greater than 12.
The null hypothesis is that the mean is 12. The
alternative is that the mean is any value other
than 12, greater or smaller.
The choice of test is important because it affects the significance level used:
essentially you use half the chosen significance level in a two-tailed test, i.e.
if it is set to be 10%, the data is examined at a 5% level. For example, in the
introduction to hypothesis testing we performed a one-tail test of the mean. Had
we considered whether the mean could be greater or smaller than 10, the p-value
of 0.04 would have been greater than half of the chosen 5% significance level,
meaning that we could not reject H0 on that evidence.
For this particular study, two common statistical tests will be examined, which test
for any statistical association between sets of data.
Pearson’s chi-squared test for independence
The chi-squared (chi refers to the Greek letter c and is pronounced 'ki', rhyming
with 'pie') statistical test comes in many forms but all involve testing the
distribution of the values in the data set against the mathematical chi-squared
distribution. Pearson's test for independence is used to test data from two
categoric variables in the same population to assess whether it is statistically likely
that the two are dependent on each other. The null hypothesis in such cases is that
they are not dependent, i.e. they are independent of each other.
Such data is displayed in what is called a contingency table (m rows × n columns)
although it does not matter which variable is listed in the rows or columns. The
Pearson test statistic (the chi-squared value) is calculated using the formula:
n
c2 = S
i=1
(Oi – Ei)2
Ei
where Oi is observed frequency for each value i in the data set (n values in total)
and Ei is the expected frequency, based on the null hypothesis being true. The
expected value for each cell in the table is calculated using the formula:
Ei =
nr 3 nc
nT
where nr and nc are the total values of each row and column respectively and
nT is the total value of the complete table. Once calculated, the chi-squared
value is then used with an appropriate data table (e.g. http://www.medcalc.org/
manual/chi-square-table.php) or technology (such as an online calculator
3.3: Processing data using statistics
11
Unit 3: Analysis of scientific data and information
http://danielsoper.com/statcalc3/calc.aspx?id=11); two more pieces of information
are required before obtaining the final answer:
•• degrees of freedom in the data set − df or n
•• significance level.
The former has a precise mathematical definition but, for this study, it should
suffice that it is a measure of how many values are used to determine the chisquared statistic that can vary. For tests of independence in contingency tables:
df = (m – 1) 3 (n – 1)
where m and n are the number of rows and columns in the respective table.
Once all of the elements for determining the final answer have been identified, it
is then a case of using the table or technology to produce the probability (usually
called the p-value) of randomly achieving a test statistic at least as high as the one
calculated. Should this probability be less than the stated significance level, we
reject the null hypothesis in favour of the alternative hypothesis, i.e. there is not
enough statistical evidence in the data at that level of significance to say that the
two variables are independent of each other.
Example
A biological field survey of two species of beetle examined the number of
sightings in three different habitats (Table 3.3.11). The null hypothesis is that the
type of habitat is not statistically significant, i.e. there is no dependence between
habitat type and the number of beetles in each species. The significance level for
this hypothesis testing was chosen to be 5% (0.05). The expected values are shown
in Table 3.3.12. Note that, although the expected frequencies often turn out to be
non-integers, categoric variables only produce integer data.
Table 3.3.11: Frequency of beetle
sightings in three different habitats,
considering two different beetle species.
Table 3.3.12: Expected values for the
distribution of beetles in three different
habitats if the null hypothesis is true
and there is no relationship between
the abundance of each species and
habitat type. The values have been
rounded to the nearest whole number.
Habitat
Beetle A
Beetle B
Totals
grassy
140
104
244
sandy
105
55
160
wet
350
275
625
Totals
595
434
1029
Habitat
Beetle A
Beetle B
grassy
244 3 595
= 141.088 = 141
1029
103
sandy
93
68
wet
361
264
The Pearson chi-squared statistic:
n
c2 = S
i=1
(Oi – Ei)2 (140 – 141)2 (104 – 103)2 (105 – 92)2 (55 – 68)2
=
+
+
+
+...
Ei
141
103
92
68
c2 = 4.84
df = (3 – 1) 3 (2 – 1) = 2
3.3: Processing data using statistics
12
Unit 3: Analysis of scientific data and information
The Pearson chi-squared statistic for the observed data for 2 degrees of freedom
is 4.84; the p-value for this chi-squared is 5.99. This is much larger than the stated
significance level of 0.05, therefore the null hypothesis cannot be rejected in
favour of the alternative hypothesis and the type of habitat is unlikely to be
statistically significant between the two species of beetle.
Pearson’s chi-squared test for independence is not appropriate for every scenario
but for ones where the data has been randomly sampled without bias and the
population is much larger than each sample, Pearson’s test is satisfactory.
Pearson’s product moment correlation coefficient
As seen earlier in this unit, two continuous pieces of data can be plotted against each
other to see if there is any linear correlation. The actual size of the correlation between
two variables is expressed by a value called the coefficient of linear correlation (usually
denoted with an r). This value can be determined by using the following formula:
r=
S(x – x)(y – y)
S(x – x)2 S(y – y)2
The outcome of this formula will indicate what level of linear correlation there is in
the data analysed:
r = −1
r = 0
r = +1
perfect negative linear correlation
no linear correlation
perfect positive linear correlation
Example
Earlier in this unit, the impact of incident light on root biomass was used as an
example to demonstrate the use of linear regression. The same data will be used
again to find the product moment correlation coefficient (Table 3.3.13). The data
can be used to determine the values required to calculate the coefficient of linear
correlation (Table 3.3.14) and the coefficient itself (Table 3.3.15).
Table 3.3.13: Root mass of plants
after exposure to different amounts
of light for the same duration.
Light (dlm)
10
20
30
40
50
60
70
Root mass (g)
0.22
0.40
0.61
0.85
1.20
1.45
1.70
Table 3.3.14: Root mass and
light intensity data from
Table 3.3.13 used to determine
the functions needed to calculate the
coefficient of linear correlation.
3.3: Processing data using statistics
x
y
(x – x)
(y – y)
10
0.22
−30
−0.699
20
0.40
−20
−0.519
30
0.61
−10
−0.309
40
0.85
0
−0.069
50
1.20
10
0.281
60
1.45
20
0.531
70
1.70
30
0.781
x=
(10 + 20 + 30 + 40 + 50 + 60 + 70) 280
=
= 40
7
7
y=
0.22 + 0.40 + 0.61 + 0.85 + 1.20 + 1.45 + 1.70 6.43
=
= 0.9186
7
7
13
Unit 3: Analysis of scientific data and information
Table 3.3.15: Calculation of the
coefficient of linear correlation for
the root mass and light intensity
data from Table 3.3.13.
(x – x) (y – y)
(x – x)2
(y – y)2
−30 × −0.699 = 20.97
(−30)2 = 900
(−0.699)2 = 0.489
10.38
400
0.269
3.09
100
0.095
0
0
0.005
2.81
100
0.079
10.62
400
0.282
23.43
900
0.610
S(x – x)(y – y) = 71.30
S(x – x)2 = 2800
S(y – y)2 = 1.829
Therefore, the value of r is:
r=
S(x – x)(y – y)
S(x – x) S(y – y)
2
2
=
71.30
(2800 3 1.829)
= 0.996
This value is close to +1, which suggests that there is a strong positive correlation
in the data produced by the two variables.
Use of the product moment correlation coefficient is very common and
meaningful, provided that the sample size, means and standard deviations
are reliable; however, the coefficient is particularly sensitive to data sets
containing values that are considered to be outlying, that is, values that are
notably different from the rest. Such values may be due to errors but they may
also be due to the process being investigated, and this is something the coefficient
cannot account for.
Technology is preferable for determining the coefficient, for obvious reasons, but
not all applications state the value directly; for example, Microsoft® Excel® gives r2
rather than r.
Checklist
At the end of this topic guide you should be able to carry out hypothesis testing using standard
statistical tests and draw conclusions by using:
 Pearson’s chi-squared test for independence
 Pearson’s product moment correlation coefficient.
Link
The material covered in this section is prerequisite knowledge for Unit 10: Statistics for experimental
design where it is explored in considerably more depth and detail.
3.3: Processing data using statistics
14
Unit 3: Analysis of scientific data and information
Take it further
Use the Internet to search for information about other statistical tests such as the z-, F- and t-tests.
There are many textbooks available on the subject and the following list is no more than suggested
reading:
Miller, J. and Miller, J. (2010) Statistics and Chemometrics for Analytical Chemistry, Prentice Hall
Samuels et al. (2010) Statistics for the Life Sciences, Pearson
Boslaugh, S. (2012) Statistics in a Nutshell, O’Reilly Media
Further reading
Use the Internet to search for information about other statistical tests such as the z-, F- and t-tests.
There are many textbooks available on the subject and the following list is no more than
suggested reading:
Boslaugh, S. (2012) Statistics in a Nutshell, O’Reilly Media
Currell, G. and Dowman, A. (2009) Essential Mathematics and Statistics for Science, Wiley, NY
Miller, J. and Miller, J. (2010) Statistics and Chemometrics for Analytical Chemistry, Prentice Hall
Samuels et al. (2010) Statistics for the Life Sciences, Pearson
Van Emden, I. (2008) Statistics for Terrified Biologists (2008), Wiley-Blackwell, Oxford UK
Acknowledgements
The publisher would like to thank the following for their kind permission to reproduce their
photographs:
Corbis: Radius Images
Every effort has been made to trace the copyright holders and we apologise in advance for any
unintentional omissions. We would be pleased to insert the appropriate acknowledgement in any
subsequent edition of this publication.
3.3: Processing data using statistics
15