Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unit 3: Analysis of scientific data and information . 33 Link Learners unfamiliar with basic statistics may wish to study F/502/5547: Unit 8: Using Statistics for Science from Edexcel BTEC Level 3 in Applied Science. Processing data using statistics Collected data will possess mathematical quantities that allow us to make qualitative judgements. Quantities of samples are known as statistics. Some are directly determined from the data, whereas others are found by comparing the data to known mathematical patterns. On successful completion of this topic you will: •• be able to process data using statistics (LO3). To achieve a Pass in this unit you need to show that you can: •• perform descriptive statistics on a sample of continuous data (3.1) •• demonstrate the nature of normal distributions using a sample of continuous scientific data (3.2) •• carry out hypothesis testing using standard statistical tests and draw conclusions (3.3). 1 Unit 3: Analysis of scientific data and information 1 Descriptive statistics 1.6 1.0 1.4 1.2 2.0 1.4 1.6 1.4 1.6 1.8 1.6 1.6 1.6 1.2 1.4 2.0 1.8 1.0 2.0 2.0 1.8 1.6 1.6 1.4 1.2 1.0 1.2 1.4 Table 3.3.1: Results of a survey of plant stem diameters. Key terms Mean: The arithmetic mean is equal to the sum of all values divided by the total number of values in the set. Also known as the average. Median: The value that lies precisely 50% of the way through the ranked data set. Mode: The most common value(s) in the set (also known as the modal value). Standard deviation: A measure of how dispersed the set values are about the mean of the set. Coefficient of variation: The ratio of the standard deviation to the mean. Table 3.3.2: Notation used to distinguish between whole populations and samples of populations when considering statistical variation. Raw data can be analysed descriptively using mathematical steps to produce figures that provide an insight into the nature of the data set. Measures of central tendency are indicators as to how values in a data set cluster about a particular value. The three most common are: •• mean – also known as the average, the arithmetic mean is equal to the sum of all values divided by the total number of values in the set •• median – the value that lies precisely 50% of the way through the ranked data set •• mode – the most common value(s) in the set (also known as the modal value). Example Determine the mean, median and mode in plant stem diameters, in mm, measured in a survey (results shown in Table 3.3.1). 1.6 + 1.0 + 1.4 + . . . Mean = = 1.514 . . . = 1.5 28 Median = 1.6 Mode = 1.6 It should be noted that a single measure of central tendency provides a limited amount of information by itself. For example, the mean in the example above tells us nothing about the maximum or minimum diameters recorded, the spread in the results or whether more diameters larger or smaller than the mean were noted. To examine the nature of the values in a data set further, we must consider what are known as measures of dispersion. These are figures that indicate how the values are distributed in the set. Two common measures are: •• standard deviation •• coefficient of variation. The two are directly related – standard deviation is a measure of how dispersed the set values are about the mean of the set and the coefficient of variation is the ratio of the standard deviation to the mean. A note concerning notation needs to be observed, however, before examining the formulae for these measures. Data may be derived from a population (all possible values) or from a sample of the population (the latter being the most common in scientific experiments). To provide the reader of statistical analyses with a means to differentiate between the two, the following notation is used (see Table 3.3.2): Quantity Sample Population value x X mean x μ standard deviation sn σ coefficient of variation cv cv 3.3: Processing data using statistics 2 Unit 3: Analysis of scientific data and information There are several formulae for standard deviation but we shall initially consider just the following, based on a sample: Sf(x – x)2 Sf sn = where f is the frequency of each value x, and x is the mean of the set of values. Should there be a frequency of just one for each value x, then the formula becomes: S(x – x)2 n sn = where n is the sample size. A limitation of this formula is that if this standard deviation of the sample is used as an estimate of the population’s standard deviation, it will produce a biased (in this case, too low) estimate for sample sizes smaller than 50 or so. A version of the formula for such samples is given by: S(x – x)2 n–1 s= where the correction factor (n − 1) helps to reduce the effect of the bias. Given that for large values of n (e.g. 500), n – 1 < n, this means you can just use this formula for all samples. The coefficient of variation is given by the formula: cv = sn x Example Using the data from the previous example, determine the standard deviation and coefficient of variation of the values shown below (Table 3.3.3). Table 3.3.3: Results of a survey of plant stem diameters grouped according to the frequency of each value, allowing the standard deviation and coefficient of variation to be calculated. x f x − μ (x − μ)2 1 3 −0.51429 0.26449 1.2 4 −0.31429 0.098776 1.4 6 −0.11429 0.013061 1.6 8 0.085714 0.007347 1.8 3 0.285714 0.081633 2 4 0.485714 0.235918 S f = 28 S f(x – m)2 = 0.701224 σ = 0.158252 cv = 0.104506 When working with grouped data, we can only calculate estimates of the measures of central tendency and dispersion. This is because the exact values of all the data points are unknown; only the number of entries in each group has been measured. 3.3: Processing data using statistics 3 Unit 3: Analysis of scientific data and information The mean is calculated using the formula: m= Sfx Sf where x is the mid-point of each group. The mode is simply the group with the highest frequency. However, the calculation required to estimate the median is a little more involved than finding the mean or mode. One method is to use a relative cumulative frequency curve (also known as an ogive) as described earlier in Figure 3.1.2; this is a plot of the upper class boundaries for each group against the relative cumulative frequencies. If we are grouping data to make the analysis easier to process, we must ensure that the group boundaries do not overlap with one another. Example Determine the mean, median and mode for the grouped data shown in Table 3.3.4. Table 3.3.4: A sample of grouped data, from which the mean, median and mode can be determined. Mass (g) f 30 d m < 40 12 40 d m < 50 27 50 d m < 60 19 60 d m < 70 2 The modal class is clearly 40 d m < 50. Table 3.3.4 can be expanded to show mid-points, upper class boundaries, etc. (see Table 3.3.5): Table 3.3.5: A sample of grouped data, expanded to show mid-points, upper class boundaries and cumulative relative frequencies. Mass (g) f Mid-point U.C.B. Relative f Cumulative relative f 30 d m < 40 12 35 40 20% 20% 40 d m < 50 27 45 50 45% 65% 50 d m < 60 19 55 60 31.7% 96.7% 60 d m < 70 2 65 70 3.3% 100% Figure 3.3.1: Relative cumulative frequency graph corresponding to the data in Table 3.3.5, showing the estimated median mass. Relative cumulative frequency (%) The cumulative frequency graph for this data is shown in Figure 3.3.1: 100 90 80 70 60 50 40 30 20 10 0 3.3: Processing data using statistics 30 35 40 45 50 55 60 65 70 Mass (g) 4 Unit 3: Analysis of scientific data and information From Figure 3.3.1, the estimated value for the median is observed to be approximately 46.5 g. Calculations of the standard deviation and coefficient of variation are carried out using the mid-point values of the groups. It is preferable to perform such tasks using technology with all but the smallest of data sets, since manual calculations present too many opportunities for errors and checking for these is time-consuming. Software such as Microsoft® Excel® contains several functions to perform some of these tasks (see Table 3.3.6): Table 3.3.6: Common statistical functions in Microsoft® Excel®. Link The material covered in this section is prerequisite knowledge for Unit 10: Statistics for experimental design. Checklist At the end of this section you should be able to calculate the following descriptive statistics: mean, mode, median standard deviation coefficient of variation. Statistic mean =average() median =median() mode =mode() standard deviation =stdev() and =stdevp (sample / population) Although the procedures to determine the measures of central tendency and dispersion of a set of data are trivial, assuming no mistakes have been made, the calculated figures are not open to interpretation – they are quantities directly extracted from the data. However, the conclusions made on the basis of the figures produced require care; for example, where the standard deviation must be viewed in context with the mean, the coefficient of variation already accounts for this. But the value of the latter is sensitive to small changes in the mean when it is close to zero. Example Two samples, A and B, are recorded and their respective standard deviations are both found to be 1.38. The mean of A is 0.015 and B is 0.011; what are their coefficients of variation? Key terms Statistical inference: The process of drawing conclusions from data subject to random variation. Distribution: A function or graphical representation thereof showing the frequency with which values in a data set tend to show different degrees of deviation from the mean. Function cv A = 1.38 = 92 0.015 cv B = 1.38 = 125.45 0.011 Sample B has a coefficient of variation 36% larger than that of A, even though its mean is only 27% smaller. A large or small measure of central tendency or dispersion is not necessarily ‘good’ or ‘bad’ – they simply provide reference points from which other statistics can be construed (known as statistical inference). 2 Normal distributions If you performed an investigation, such as recording the surface area of every leaf from a tree or the heat energy released in many repeats of the same standard thermite reaction, the measured values would typically accumulate to produce a distribution of values as shown in Figure 3.3.2. The frequency of the measured values is displayed on the y-axis, with the values themselves on the x-axis. 3.3: Processing data using statistics 5 Unit 3: Analysis of scientific data and information As covered in the previous section of this unit, you can descriptively note the central tendency and dispersion. Figure 3.3.2: A normal distribution, with values on the y-axis and frequency on the x-axis. σ σ μ Such distributions of data are very common in scientific investigations and they are called normal distributions. The key features of a normal distribution are: •• the mean, median and mode all have the same value •• the distribution is symmetrical about the mean •• approximately 68% of the data set lies within one standard deviation either side of the mean. Key terms Normal distribution: A commonly observed distribution in data sets in which the mean, median and mode all have the same value, the distribution is symmetrical about the mean and approximately 68% of the data set lies within one standard deviation either side of the mean. Quantile: Any regular division of the cumulative frequency distribution of a data set. If a data set is divided into 100 quantiles, they are known as percentiles. Z-score: The number of standard deviations a value is placed above the mean. There are other well-known distributions that commonly appear in science (e.g. beta distribution, chi-squared distribution, Student’s t-distribution). Although the normal distribution appears frequently, you cannot assume that any collected data follows this pattern and there are various tests to assess how closely a set of data matches a normal distribution (known as normality testing). Statistical software packages, such as SPSS® or MATLAB®, can quickly analyse data to assess normality but, if such bespoke applications are not available, then a reasonably robust method can be performed in spreadsheet programs such as Microsoft® Excel®: a normal quantile plot. Normality can be tested in three steps as shown below, with an example presented in Table 3.3.7. 1 The first step is to produce a single ranked list of the data, starting with the smallest value. 2 The next step is to determine which cumulative proportion (also known as the quantile) each data point would have, based on its rank. To do this, use the rank function in one column and then the count function in another to calculate the quantile. The inverse normal function can then be used to produce the theoretical z-scores for these quantiles. 3 Finally, the data points are plotted against the z-scores; the closer the plotted points are to a straight line, the closer the data set is to a normal distribution. Example The distribution of the data set shown in Table 3.3.7 is unknown, but can be established by ranking, assigning quantiles and calculating z-scores that measure the number of standard deviations the value is above the mean (hence the lowest values have negative z-scores). The calculations can be performed automatically in Microsoft® Excel® (Table 3.3.8) and plotting values against 3.3: Processing data using statistics 6 Unit 3: Analysis of scientific data and information z-scores shows that the data are not normally distributed, because the lower values of x fall below the line (see Figure 3.3.3). Table 3.3.7: Data set to be tested for normal distribution by ranking, quantile assignment and z-score calculation. Table 3.3.8: Calculation of ranking, quantile and z-scores in Microsoft® Excel®. x Rank Quantile z-score 0.70 1 0.03 −1.8339 0.77 2 0.10 −1.2816 0.98 3 0.17 −0.9674 1.12 4 0.23 −0.7279 1.56 5 0.30 −0.5244 1.80 6 0.37 −0.3407 2.00 7 0.43 −0.1679 2.00 7 0.43 −0.1679 2.50 9 0.57 0.1679 2.90 10 0.63 0.3407 3.00 11 0.70 0.5244 3.50 12 0.77 0.7279 4.10 13 0.83 0.9674 5.00 14 0.90 1.2816 6.02 15 0.97 1.8339 A B C D 1 x Rank Quantile Z-score 2 0.7 =RANK(A2, range A col, 1) =(B2-0.5)/(count(B column) =normsinv(C2) Z-score Figure 3.3.3: Plot of values against z-scores for the data set shown in Table 3.3.7. A normal distribution would be shown by a straight line. The data are not normally distributed, because the lower values of x fall below the line. 3.0 2.0 1.0 0.0 Take it further x 2 4 6 –1.0 There are more detailed approaches to determine normality. Use the Internet to search for Jarque-Bera and Lillefors tests. –2.0 –3.0 Standardisation As discussed above, the z-score (or z-value or standard value) for a data point is a measure of the number of standard deviations the value is above the mean. The process of converting data points into z-values is called standardisation (sometimes also called normalisation) and is done using the formula: 3.3: Processing data using statistics 7 Unit 3: Analysis of scientific data and information z= x–m s It should be noted that the mean itself has a z-score of zero and that a value of x that is equal to the standard deviation will have a z-score of 1. If you standardise the normal distribution, it looks like the graph shown in Figure 3.3.4: Figure 3.3.4: A plot of value frequency against z-score for a normal distribution. –4 –3 –2 –1 0 1 2 3 4 Z-value Percentiles Z-score Key terms Standardised: Data that has been converted into z-scores, showing the number of standard deviations above the mean. Population: A complete data set. Percentage of data between mean and z-score (2 d.p.) Percentile (2 d.p.) 8 Table 3.3.9: Relationship between normally distributed data, percentiles and z-scores. – 50.00% 0.00th −4 49.99% 0.01th −3 49.86% 0.14th −2 47.72% 2.28th −1 34.13% 15.87th 0 0.00% 50.00th 1 34.13% 84.13th 2 47.72% 97.72th 3 49.86% 99.86th 4 49.99% 99.99th 8 In a normal distribution, standardised or otherwise, 68.2% of the set of data will lie within one standard deviation (i.e. a z-score of 1) either side of the mean; 95.4% will lie within two standard deviations (Table 3.3.9). 50.00% 100th This property of the distribution is extremely useful: statistical hypotheses can be tested on the basis that values with z-scores more than ±3 are rare. A natural process that follows a normal distribution can be expected to repeatedly produce values within four standard deviations for 99.99% of the time the population is sampled. 3.3: Processing data using statistics 8 Unit 3: Analysis of scientific data and information Link The material covered in this section is prerequisite knowledge for Unit 10: Statistics for experimental design and the concepts of samples of populations, standard errors and confidence limits are examined to an appropriate level. Checklist At the end of this section you should be able to: demonstrate the nature of normal distributions using a sample of continuous scientific data by conducting a normality test such as a normal quantile plot. 3 Statistical testing The logical step forward from examining one set of data or the results from one experiment is to compare them against other sets of information. For example, testing to see if a given catalyst genuinely increases the rate of a chemical reaction for all substances or that a newly created anti-inflammatory drug is more effective than currently used drugs. Hypothesis testing Key terms Hypothesis: A proposed explanation for an observation, which can be tested using the scientific method. Null hypothesis: A general or default position that there is no relationship between two measured phenomena. Alternative hypothesis: A rival to a null hypothesis, typically postulating that there is a statistical relationship between two phenomena, which must be proven by applying a statistical hypothesis test. P-value: A measurement of the probability of observing a given value in a data set assuming that the null hypothesis is true. Statistical hypothesis testing involves an assessment of scientific data or information in such a way as to judge whether or not any observed patterns are present purely by chance, as measured against a pre-determined probability limit. This is most commonly done using the so-called null hypothesis (denoted mathematically by H0) – a statement that there is no relationship between two recorded events or that a tested item does not affect the bodies that it has been applied to. The alternative hypothesis (H1) is effectively an opposing statement: that there is a relationship present. The default position in hypothesis testing is that the null hypothesis is the one that should be accepted; this means that the values and patterns found in the sets of observed data have arisen purely by chance. To begin the analysis, a statistic of the data, such as the mean or the frequency of a specific category, needs to be chosen and calculated. Then the appropriate distribution for this statistic needs to be selected; for example, would calculating the means from numerous samples result in a normal distribution? Next, you calculate the probability (known as the p-value) that a number at least as large/small as the tested statistic would appear in the selected distribution. For example, the means of samples often have normal distributions and let us suppose that the distribution of a collection of 1000 samples has a mean of 10 and a standard deviation of 2. Another sample is taken and its mean is found to be 13.5; in the normal distribution, this would have a z-score of +0.75 and the resulting p-value would be about 0.04. This is quite a low probability, suggesting that such a mean is unlikely to occur by chance. But how do we really know this? We cannot know for certain but we can make a decision that if the p-value is smaller than an arbitrarily chosen cut-off point 3.3: Processing data using statistics 9 Unit 3: Analysis of scientific data and information Key term Significance level/value: An arbitrarily chosen cut-off point (α) above which an observed value is deemed likely to have occurred by chance rather than being statistically significant. (called the significance level or value, denoted usually with the letter α), we can say that, since it is unlikely that such a statistic will occur by chance alone, it is unlikely that the null hypothesis is true and so we must reject it in favour of the alternative hypothesis on the basis of this evidence. Thus, if we had chosen a value of 5% or 0.05, then we are effectively saying that the chance of the given test statistic appearing is no lower than 5% and anything lower than that is not likely to appear by chance. Note that in either case, the null hypothesis cannot be completely proven or disproven – collected data provides evidence with which to make judgements, not certainties. No statistical test can say whether either of the hypotheses themselves is actually true or false. Displaying hypotheses mathematically The previous statements about the null hypothesis, p-value and significance level can be summarised more conveniently using mathematical notation. Using the examples stated, we would write the following: population: m = 10; s = 2 sample: x = 13.5 H0: x = 10 stating that the mean of the sample should be about 10 H1: x > 10 stating that the mean of the sample should be more than 10 z-score: z = +0.75 p-value: p = 0.04, the probability of a value in a normal distribution having a z-score equal to or greater than +0.75 significance level: α = 0.05 Since p < α, the sample suggests that it is unlikely that the observed statistic occurred just by chance, and H0 should be rejected in favour of H1. Significance levels Strictly speaking, the significance level of a hypothesis test is the probability that the null hypothesis is incorrectly rejected in favour of the alternative one. The smaller the level, the greater the evidence the data needs to present in order to reject the null hypothesis. Typical levels are 5% and 1%, but there are no set rules for determining what level to use and, even if H0 is rejected at one level, there will always be a lower level where H0 cannot be rejected. Although arbitrary, the level needs to be chosen with care. You might believe it is always best to choose a low value all the time (e.g. 0.01%) to be ‘statistically certain’ that the null hypothesis is correctly rejected or not rejected but, in doing so, it could result in the p-value always being larger than the significance level, no matter how the experiment is conducted. The subjective nature of selecting the significance value of a hypothesis test has proven to be a matter of controversy among academics for some time, and it has been argued that more emphasis should be placed on the p-value itself, rather than the significance level. One-tailed, two-tailed tests Another matter of subjectivity is the choice of uniformity or direction in the null hypothesis: namely, whether it is one-tailed or two-tailed. Originally referring 3.3: Processing data using statistics 10 Unit 3: Analysis of scientific data and information to the extreme ends of the normal distribution curve, it is now used to identify whether you are testing for a specific difference between the alternative and null hypotheses parameters (one tail) or testing for just any difference at all (two tail). Example Table 3.3.10: Comparison of the onetailed and two-tailed tests for testing specific or general differences between the null and alternative hypotheses. Link The scope of statistical testing in this unit is limited; it is examined in more detail in Unit 10: Statistics for experimental design. One-tailed test Two-tailed test H0: μ = 12 H1: μ > 12 H0: μ = 12 H1: μ ≠ 12 The null hypothesis (H0) is that the mean (μ) is 12. The alternative (H1) is that the mean is greater than 12. The null hypothesis is that the mean is 12. The alternative is that the mean is any value other than 12, greater or smaller. The choice of test is important because it affects the significance level used: essentially you use half the chosen significance level in a two-tailed test, i.e. if it is set to be 10%, the data is examined at a 5% level. For example, in the introduction to hypothesis testing we performed a one-tail test of the mean. Had we considered whether the mean could be greater or smaller than 10, the p-value of 0.04 would have been greater than half of the chosen 5% significance level, meaning that we could not reject H0 on that evidence. For this particular study, two common statistical tests will be examined, which test for any statistical association between sets of data. Pearson’s chi-squared test for independence The chi-squared (chi refers to the Greek letter c and is pronounced 'ki', rhyming with 'pie') statistical test comes in many forms but all involve testing the distribution of the values in the data set against the mathematical chi-squared distribution. Pearson's test for independence is used to test data from two categoric variables in the same population to assess whether it is statistically likely that the two are dependent on each other. The null hypothesis in such cases is that they are not dependent, i.e. they are independent of each other. Such data is displayed in what is called a contingency table (m rows × n columns) although it does not matter which variable is listed in the rows or columns. The Pearson test statistic (the chi-squared value) is calculated using the formula: n c2 = S i=1 (Oi – Ei)2 Ei where Oi is observed frequency for each value i in the data set (n values in total) and Ei is the expected frequency, based on the null hypothesis being true. The expected value for each cell in the table is calculated using the formula: Ei = nr 3 nc nT where nr and nc are the total values of each row and column respectively and nT is the total value of the complete table. Once calculated, the chi-squared value is then used with an appropriate data table (e.g. http://www.medcalc.org/ manual/chi-square-table.php) or technology (such as an online calculator 3.3: Processing data using statistics 11 Unit 3: Analysis of scientific data and information http://danielsoper.com/statcalc3/calc.aspx?id=11); two more pieces of information are required before obtaining the final answer: •• degrees of freedom in the data set − df or n •• significance level. The former has a precise mathematical definition but, for this study, it should suffice that it is a measure of how many values are used to determine the chisquared statistic that can vary. For tests of independence in contingency tables: df = (m – 1) 3 (n – 1) where m and n are the number of rows and columns in the respective table. Once all of the elements for determining the final answer have been identified, it is then a case of using the table or technology to produce the probability (usually called the p-value) of randomly achieving a test statistic at least as high as the one calculated. Should this probability be less than the stated significance level, we reject the null hypothesis in favour of the alternative hypothesis, i.e. there is not enough statistical evidence in the data at that level of significance to say that the two variables are independent of each other. Example A biological field survey of two species of beetle examined the number of sightings in three different habitats (Table 3.3.11). The null hypothesis is that the type of habitat is not statistically significant, i.e. there is no dependence between habitat type and the number of beetles in each species. The significance level for this hypothesis testing was chosen to be 5% (0.05). The expected values are shown in Table 3.3.12. Note that, although the expected frequencies often turn out to be non-integers, categoric variables only produce integer data. Table 3.3.11: Frequency of beetle sightings in three different habitats, considering two different beetle species. Table 3.3.12: Expected values for the distribution of beetles in three different habitats if the null hypothesis is true and there is no relationship between the abundance of each species and habitat type. The values have been rounded to the nearest whole number. Habitat Beetle A Beetle B Totals grassy 140 104 244 sandy 105 55 160 wet 350 275 625 Totals 595 434 1029 Habitat Beetle A Beetle B grassy 244 3 595 = 141.088 = 141 1029 103 sandy 93 68 wet 361 264 The Pearson chi-squared statistic: n c2 = S i=1 (Oi – Ei)2 (140 – 141)2 (104 – 103)2 (105 – 92)2 (55 – 68)2 = + + + +... Ei 141 103 92 68 c2 = 4.84 df = (3 – 1) 3 (2 – 1) = 2 3.3: Processing data using statistics 12 Unit 3: Analysis of scientific data and information The Pearson chi-squared statistic for the observed data for 2 degrees of freedom is 4.84; the p-value for this chi-squared is 5.99. This is much larger than the stated significance level of 0.05, therefore the null hypothesis cannot be rejected in favour of the alternative hypothesis and the type of habitat is unlikely to be statistically significant between the two species of beetle. Pearson’s chi-squared test for independence is not appropriate for every scenario but for ones where the data has been randomly sampled without bias and the population is much larger than each sample, Pearson’s test is satisfactory. Pearson’s product moment correlation coefficient As seen earlier in this unit, two continuous pieces of data can be plotted against each other to see if there is any linear correlation. The actual size of the correlation between two variables is expressed by a value called the coefficient of linear correlation (usually denoted with an r). This value can be determined by using the following formula: r= S(x – x)(y – y) S(x – x)2 S(y – y)2 The outcome of this formula will indicate what level of linear correlation there is in the data analysed: r = −1 r = 0 r = +1 perfect negative linear correlation no linear correlation perfect positive linear correlation Example Earlier in this unit, the impact of incident light on root biomass was used as an example to demonstrate the use of linear regression. The same data will be used again to find the product moment correlation coefficient (Table 3.3.13). The data can be used to determine the values required to calculate the coefficient of linear correlation (Table 3.3.14) and the coefficient itself (Table 3.3.15). Table 3.3.13: Root mass of plants after exposure to different amounts of light for the same duration. Light (dlm) 10 20 30 40 50 60 70 Root mass (g) 0.22 0.40 0.61 0.85 1.20 1.45 1.70 Table 3.3.14: Root mass and light intensity data from Table 3.3.13 used to determine the functions needed to calculate the coefficient of linear correlation. 3.3: Processing data using statistics x y (x – x) (y – y) 10 0.22 −30 −0.699 20 0.40 −20 −0.519 30 0.61 −10 −0.309 40 0.85 0 −0.069 50 1.20 10 0.281 60 1.45 20 0.531 70 1.70 30 0.781 x= (10 + 20 + 30 + 40 + 50 + 60 + 70) 280 = = 40 7 7 y= 0.22 + 0.40 + 0.61 + 0.85 + 1.20 + 1.45 + 1.70 6.43 = = 0.9186 7 7 13 Unit 3: Analysis of scientific data and information Table 3.3.15: Calculation of the coefficient of linear correlation for the root mass and light intensity data from Table 3.3.13. (x – x) (y – y) (x – x)2 (y – y)2 −30 × −0.699 = 20.97 (−30)2 = 900 (−0.699)2 = 0.489 10.38 400 0.269 3.09 100 0.095 0 0 0.005 2.81 100 0.079 10.62 400 0.282 23.43 900 0.610 S(x – x)(y – y) = 71.30 S(x – x)2 = 2800 S(y – y)2 = 1.829 Therefore, the value of r is: r= S(x – x)(y – y) S(x – x) S(y – y) 2 2 = 71.30 (2800 3 1.829) = 0.996 This value is close to +1, which suggests that there is a strong positive correlation in the data produced by the two variables. Use of the product moment correlation coefficient is very common and meaningful, provided that the sample size, means and standard deviations are reliable; however, the coefficient is particularly sensitive to data sets containing values that are considered to be outlying, that is, values that are notably different from the rest. Such values may be due to errors but they may also be due to the process being investigated, and this is something the coefficient cannot account for. Technology is preferable for determining the coefficient, for obvious reasons, but not all applications state the value directly; for example, Microsoft® Excel® gives r2 rather than r. Checklist At the end of this topic guide you should be able to carry out hypothesis testing using standard statistical tests and draw conclusions by using: Pearson’s chi-squared test for independence Pearson’s product moment correlation coefficient. Link The material covered in this section is prerequisite knowledge for Unit 10: Statistics for experimental design where it is explored in considerably more depth and detail. 3.3: Processing data using statistics 14 Unit 3: Analysis of scientific data and information Take it further Use the Internet to search for information about other statistical tests such as the z-, F- and t-tests. There are many textbooks available on the subject and the following list is no more than suggested reading: Miller, J. and Miller, J. (2010) Statistics and Chemometrics for Analytical Chemistry, Prentice Hall Samuels et al. (2010) Statistics for the Life Sciences, Pearson Boslaugh, S. (2012) Statistics in a Nutshell, O’Reilly Media Further reading Use the Internet to search for information about other statistical tests such as the z-, F- and t-tests. There are many textbooks available on the subject and the following list is no more than suggested reading: Boslaugh, S. (2012) Statistics in a Nutshell, O’Reilly Media Currell, G. and Dowman, A. (2009) Essential Mathematics and Statistics for Science, Wiley, NY Miller, J. and Miller, J. (2010) Statistics and Chemometrics for Analytical Chemistry, Prentice Hall Samuels et al. (2010) Statistics for the Life Sciences, Pearson Van Emden, I. (2008) Statistics for Terrified Biologists (2008), Wiley-Blackwell, Oxford UK Acknowledgements The publisher would like to thank the following for their kind permission to reproduce their photographs: Corbis: Radius Images Every effort has been made to trace the copyright holders and we apologise in advance for any unintentional omissions. We would be pleased to insert the appropriate acknowledgement in any subsequent edition of this publication. 3.3: Processing data using statistics 15