Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LC•GC Europe Online Supplement statistics and data analysis Understanding the Structure of Scientific Data Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK. This is the first in a series of articles that aims to promote the better use of statistics by scientists. The series intends to show everyone from bench chemists to laboratory managers that the application of many statistical methods does not require the services of a ‘statistician’ or a ‘mathematician’ to convert chemical data into useful information. Each article will be a concise introduction to a small subset of methods. Wherever possible, diagrams will be used and equations kept to a minimum; for those wanting more theory, references to relevant statistical books and standards will be included. By the end of the series, the scientist should have an understanding of the most common statistical methods and be able to perform the test while avoiding the pitfalls that are inherent in their misapplication. In this article we look at the initial steps in data analysis (i.e., exploratory data analysis), and how to calculate the basic summary statistics (the mean and sample standard deviation). These two processes, which increase our understanding of the data structure, are vital if the correct selection of more advanced statistical methods and interpretation of their results are to be achieved. From that base we will progress to significance testing (t-tests and the F-test). These statistics allow a comparison between two sets of results in an objective and unbiased way. For example, significance tests are useful when comparing a new analytical method with an old method or when comparing the current day’s production with that of the previous day. Exploratory Data Analysis Exploratory data analysis is a term used to describe a group of techniques (largely graphical in nature) that sheds light on the structure of the data. Without this knowledge the scientist, or anyone else, cannot be sure they are using the correct form of statistical evaluation. The statistics and graphs referred to in this first section are applicable to a single column of data (i.e., univariate data), such as the number of analyses performed in a laboratory each month. For small amounts of data (<15 points), a blob plot (also known as a dot plot) can be used to explore how the data set is distributed (Figure 1). Blob plots are constructed simply by drawing a line, marking it off with a suitable scale and plotting the data along the axis. A stem-and-leaf plot is yet another method for examining patterns in the data set. These are complex to describe and perceived as old fashioned, especially with the modern graphical packages available today. For the sake of completeness they are described in Box 1. For larger data sets, frequency histograms (Figure 2(a)) and Box and Whisker plots (Figure 2(b)) may be better options to display the data distribution. Once the data set is entered, or as is more usual with modern instrumentation, electronically imported, most modern PC statistical packages can construct these graph types with a few clicks of the mouse. All of these plots can give an indication of the presence or absence of outliers (1). The frequency histogram, stem and leaf plot, and blob plot can also indicate the type of distribution the data belongs to. It should be remembered that if the data set is from a non-normal (2) distribution, (Figure 2(a) and possibly Figure 1(a)), it may be that which looks like an outlier is in fact a good piece of information. The outliers are the most extreme points on the right-hand side of Figures 1(a) and 2(a). Note: Outliers, outlier tests and robust methods will be the subject of a later article. Assuming there are no obvious outliers, we still have to do one more plot to make sure we understand the data structure. The individual results should be plotted against a time index (i.e., the order the data were (a) Scale Mean (b) Scale Mean figure 1 Blob plots of the raw data. 3 LC•GC Europe Online Supplement statistics and data analysis obtained). If any systematic trends are observed (Figures 3(a)–3(c)) then the reasons for this must be investigated. Normal statistical methods assume a random distribution about the mean with time (Figure 3(d)) but if this is not the case the interpretation of the statistics can be erroneous. Summary Statistics Summary statistics are used to make sense of large amounts of data. Typically, the mean, sample standard deviation, range, confidence intervals, quantiles (1), and measures for skewness and spread/peakedness of the distribution (kurtosis) are reported (2). The mean and sample standard deviation are the most widely used and are discussed below Box 1: Stem-and-leaf plot A stem-and-leaf plot is another method of examining patterns in the data set. They show the range, in which the values are concentrated, and the symmetry. This type of plot is constructed by splitting data into the stem (the leading digits). In the figure below, this is from 0.1 to 0.6, and the leaf (the trailing digit). Thus, 0.216 is represented as 2|1 and 0.350 by 3|5. Note, the decimal places are truncated and not rounded in this type of plot. Reading the plot below, we can see that the data values range from 0.12 to 0.63. The column on the left contains the depth information (i.e., how many leaves lie on the lines closest to the end of the range). Thus, there are 13 points which lie between 0.40 and 0.63. The line containing the middle value is indicated differently with a count (the number of items in the line) and is enclosed in parentheses. Stem-and-leaf plot Units = 0.1 42 1|2 = 0.12 Count = together with how they relate to the confidence intervals for normally distributed data. The Mean The average or arithmetic mean (3) is generally the first statistic everyone is taught to calculate. This statistic is easily found using a calculator or spreadsheet and simply involves the summing of the individual results x1, x2, x3, ..., xi) and division by the number of results (n), n xi i1 x n where, n x1 x2 x3 … xi i1 Frequency (Nº of data points in each bar) 4 (a) 1|22677 2|112224578 3|000011122333355 4|0047889 5|56669 6|3 The Standard Deviation (3) The standard deviation is a measure of the spread of data (dispersion) about the mean and can again be calculated using a calculator or spreadsheet. There is, however, a slight added complication; if you look at a typical scientific calculator you will notice there are two types of (b) 1.5 interquartile upper quartile value interquartile median lower quartile value 1.5 interquartile *outlier interquartile range is the range which contains the middle 50% of the data when *The it is sorted into ascending order. figure 2 Frequency histogram and Box and Whisker plot. (a) Magnitude 10 8 6 4 2 0 (c) 10 8 6 4 2 0 (b) Magnitude 10 8 6 4 2 Time 0 Time n = 7, mean = 6, standard deviation = 2.16 n = 9, mean = 6, standard deviation = 2.65 Magnitude 5 14 (15) 13 6 1 Unfortunately, the mean is often reported as an estimate of the ‘true-value’ (m) of whatever is being measured without considering the underlying distribution. This is a mistake. Before any statistic is calculated it is important that the raw data should be carefully scrutinized and plotted as described above. An outlying point can have a big effect on the mean (compare Figure 1(a) with 1(b)). (d) Magnitude 10 8 6 4 2 Time 0 Time n = 9, mean = 6, standard deviation = 1.80 n = 9, mean = 6, standard deviation = 2.06 figure 3 Time-indexed plots. LC•GC Europe Online Supplement statistics and data analysis 99.7% 95% 68% Mean -3 -2 -1 0 1 2 3 Standard deviations from the mean figure 4 The relationship between the normal distribution curve, the mean and standard deviation. (a) (i) standard deviation (denoted by the symbols n and n-1, or and s). The correct one to use depends upon how the problem is framed. For example, each batch of a chemical contains 10 sub-units. You are asked to analyse each sub-unit, in a single batch, for mercury contamination and report the mean mercury content and standard deviation. Now, if the mean and standard deviation are to be used solely with this analysed batch, then the 10 results represent the whole population (i.e., all are tested) and the correct standard deviation to use is the one for a population (n). If, however, the intended use of the results is to estimate the mercury probably not different and would 'pass' the t-test (tcrit > tcalculated value) (ii) (b) (i) probably different and would 'fail' the t-test (tcrit < tcalculated value) (ii) (c) (i) could be different but not enough data to say for sure (i.e., would 'pass' the t-test [tcrit > tcalculated value]) (ii) µ1 (d) (i) (ii) µ2 n ((xi µ)2 / n) practically identical means, but with so many data points there is a small but statistically siginificant ('real') difference and so would 'fail' the t-test (tcrit < tcalculated value) (e) (i) spread in the data as measured by the variance are similar would 'pass' the F-test (Fcrit > Fcalculated value) (ii) (f) (i) (ii) spread in the data as measured by the variance are different would 'fail' the F-test (Fcrit < Fcalculated value) and hence (i) gives more consistent results than (ii) (g) (i) (ii) contamination for several batches of the chemical, the 10 results then represent a sample from the whole population and the correct standard deviation to use is that for a sample (n-1). If you are using a statistical package you should always check that the correct standard deviation is being calculated for your particular problem. could be a different spread but not enough data to say for sure would 'pass' the F-test (Fcrit > Fcalculated value) figure 5 Comparison of different data sets. s n–1 ((xi x)2 / n 1) Interpreting the mean and standard deviation If the distribution is normal (i.e., when the data are plotted it approximates to the curve shown in Figure 4) then the mean is located at the centre of the distribution. Sixty-eight per 0cent of the results will be contained within ±1 standard deviation from the mean, 95% within ±2 standard deviations and 99.7% within ±3 standard deviations. Using the above facts it is possible to estimate a standard deviation from a stated confidence interval and vice versa a confidence interval from a standard deviation. For example, if a mean value of 0.72 ±0.02 g/L at the 95% confidence level is quoted then it follows that the standard deviation = 0.02/2 or 0.01 g/L. If the same figure was quoted at the 99.7% confidence level the standard deviation would be 0.02/3 or 0.0066 g/L. 5 6 LC•GC Europe Online Supplement statistics and data analysis Significance Testing Suppose, for example, we have the following two sets of results for lead content in water 17.3, 17.3, 17.4, 17.4 and 18.5, 18.6, 18.5, 18.6. It is fairly clear, by simply looking at the data, that the two sets are different. In reaching this conclusion you have probably considered the amount of data, the average for each set and the spread in the results. The difference between two sets of data is, however, not so clear in many situations. The application of significance tests gives us a more systematic way of assessing the results with the added advantage of allowing us to express our conclusion with a stated degree of confidence. What does significance mean? In statistics the words ‘significant’ and ‘significance’ have specific meanings. A significant difference, means a difference that is unlikely to have occurred by chance. A significance test, shows up differences unlikely to occur because of a purely random variation. As previously mentioned, to decide if one set of results is significantly different from another depends not only on the magnitude of the difference in the means but also on the amount of data available Jargon and its spread. For example, consider the blob plots shown in Figure 5. For the two data sets shown in Figure 5(a), the means for set (i) and set (ii) are numerically different. From the limited amount of information available, however, they are from a statistical point of view the same. For Figure 5(b), the means for set (i) and set (ii) are probably different but when fewer data points are available, Figure 5(c), we cannot be sure with any degree of confidence that the means are different even if they are a long way apart. With a large number of data points, even a very small difference, can be significant (Figure 5(d)). Similarly, when we are interested in comparing the spread of results, for example, when we want to know if method (i) gives more consistent results than method (ii), we have to take note of the amount of information available (Figures 5(e)–(g)). It is fortunate that tables are published that show how large a difference needs to be before it can be considered not to have occurred by chance. These are, critical t-value for differences between means, and critical F-values for differences between the spread of results (4). Note: Significance is a function of sample size. Comparing very large samples will Definition Alternate Hypothesis A statement describing the alternative to the null hypothesis (H1) (i.e., there is a difference between the means [see two-tailed] or mean1 is ≥ mean2 [see one-tailed]). Critical Value (tcrit or Fcrit) cance The value obtained from statistical tables or statistical packages at a given confidence level against which the result of applying a signifitest is compared. Null hypothesis (H0) A statement describing what is being tested (i.e., there is no difference between the two means [mean1 = mean2]). One-tailed A one-tailed test is performed if the analyst is only interested in the answer when the result is different in one direction, for example, (1) the new production method results in a higher yield, or (2) the amount of waste product is reduced (i.e., a limit value ≤, >, <, or ≥ is used in the alternate hypothesis). In these cases the calculation to determine the t-value is the same as that for the two-tailed t-test but the critical value is different. Population Sample Two-tailed A large group of items or measurements under investigation (e.g., 2500 lots from a single batch of a certified reference material). A group of items or measurements taken from the population (e.g., 25 lots of a certified reference material taken from a batch containing 2500 lots). A two-tailed t-test is performed if the analyst is interested in any change. For example, is method A different from method B (i.e., ≠ is used in the alternate hypothesis. Under most circumstances two-tailed t-tests should be performed). table 1 Definitions of statistical terms used in significance testing. nearly always lead to a significant difference but a statistically significant result is not necessarily an important result. For example in Figure 5(d) there is a statistically significant difference, but does it really matter in practice? What is a t-test? A t-test is a statistical procedure that can be used to compare mean values. A lot of jargon surrounds these tests (see Table 1 for definition of the terms used below) but they are relatively simple to apply using the built-in functions of a spreadsheet like Excel or a statistical software package. Using a calculator is also an option but you have to know the correct formula to apply (see Table 2) and have access to statistical tables to look up the so-called critical values (4). Three worked examples are shown in Box 2 (5) to illustrate how the different t-tests are carried out and how to interpret the results. What is an F-test? An F-test compares the spread of results in two data sets to determine if they could reasonably be considered to come from the same parent distribution. The test can, therefore, be used to answer questions such as are two methods equally precise? The measure of spread used in the F-test is variance which is simply the square of the standard deviation. The variances are ratioed (i.e., divide the variance of one set of data by the variance, of the other) to get the test value F = 2 S1 2 S2 This F value is then compared with a critical value that tells us how big the ratio needs to be to rule out the difference in spread occurring by chance. The Fcrit value is found from tables using (n1–1) and (n2–1) degrees of freedom, at the appropriate level of confidence. [Note: it is usual to arrange s1 and s2 so that F > 1]. If the standard deviations are to be considered to come from the same population then Fcrit > F. As an example we use the data in Example 2 (see Box 2). 2 F 2.75 1.471 2 3.49 Fcrit = 9.605 (5–1) and (5–1) degrees of freedom at the 97.5% confidence level. As Fcrit> Fcalculated we can conclude that the spread of results in the two data sets are not significantly different and it is, therefore, reasonable to combine the two standard deviations as we have done. LC•GC Europe Online Supplement Using statistical software (what is a p-value?) When you use statistical software packages and some spreadsheet functions, the results of performing a significance test are often summarized as a p-value. The p-value represents an inverse index of the reliability of the statistic (i.e., the probability of error in accepting the observed result as valid). Thus, if we are comparing two means to see if they are different a p-value of 0.10 is equivalent to saying we are 90% certain that the means are different; 0.05 is equivalent to saying we are 95% certain that the means are different; and 0.01 we are 99% certain that the means are different, i.e., [(1–p) x 100%]. It is usual when analysing chemical data (but somewhat arbitrary) to say that p-levels ≤ 0.05 are statistically significant. Some assumptions behind significance testing In most statistical tests it is assumed that the sample correctly represents the population and that the population follows a normal distribution. Although these assumptions are never complied with precisely, in a large number of situations where laboratory data is being used they are not grossly violated. Conclusions • Always plot your data and understand the patterns in it before calculating any statistic, even the arithmetic mean. • Make sure the correct standard deviation is calculated for your particular circumstance. This will nearly always be the sample standard deviation (n-1). • Significance tests are used to compare, in an unbiased way, the means or spread (variance) of two data sets. • The tests are easily performed using statistical routines in spreadsheets and statistical packages. • The p-value is a measure of confidence in the result obtained when applying a significance test. Acknowledgement The preparation of this paper was supported under a contract with the UK Department of Trade and Industry as part of the National Measurement System Valid Analytical Measurement Programme (VAM)6. References (1) ISO 3534 part 1: Statistics Vocabulary and Symbols. Part 1: Probability and General Statistical Terms (1993). (2) BS 2846 part 7: Tests for Departure from Normality (1984). statistics and data analysis (3) BS 2846 part 4 (ISO 2854): Techniques of Estimation Relating to Means and Variances (1976). (4) D.V. Lindley and W.F. Scott, New Cambridge Elementary Statistical Tables (ISBN: 0 521 48485 5). Cambridge University Press (1995). (5) T.J. Farrant, Practical Statistics for the Analytical Scientist: A Bench Guide (ISBN: 085 404 4426), Royal Society of Chemistry (1997). (6) M. Sargent, VAM Bulletin, Issue 13, 4–5, (Laboratory of the Government Chemist, Teddington, UK) Autumn 1995. Shaun Burke currently works in the Food Technology Department of RHM Technology Ltd, High Wycombe, Buckinghamshire, UK. However, these articles were produced while he was working at LGC, Teddington, Middlesex, UK (http://www.lgc.co.uk). Bibliography 1. G.B. Wetherill, Elementary Statistical Methods, Chapman and Hall, London, UK. 2. J.C. Miller and J.N. Miller, Statistics for Analytical Chemistry, Ellis Horwood PTR Prentice Hall, London, UK. 3. J. Tukey, Exploration of Data Analysis, Edison and Westley. 4. T.J. Farrant, Practical Statistics for the Analytical Scientist: A Bench Guide (ISBN: 085 404 4426), Royal Society of Chemistry, London, UK (1997). t-test to use when comparing Equation The long-term average (population mean, µ) with a sample mean t The difference between two means (e.g., two analytical methods) x µ s/ n For a two-tailed test t d n sd For a one-tailed test the sign is important t d n sd Difference between independent sample means with equal variances t x1 x2 1 1 sc n1 n2 Difference between independent sample means with unequal variances† t x1 x2 s21 s22 n1 n2 where: – x is the sample mean, µ is the population mean, s is the standard deviation for the sample, n is the number items in the sample, – – |d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the – – pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample 2 and s is the combined standard deviation found using c sc 2 s1 n1 1 s2 n2 1 n1 n2 2 where s1 and s2 are the sample standard deviations. †Note: The degrees of freedom (υ) used for looking up the critical t value for independent sample means with unequal variances is given by s41 s42 s21 s22 1 υ k 2 n2 n 1 k 2 n2 n – 1 where k n1 n2 1 1 2 2 table 2 Summary of statistical formulae. 7 8 LC•GC Europe Online Supplement statistics and data analysis Box 2 Example 1 A chemist is asked to validate a new economic method of derivatization before analysing a solution by a standard gas chromatography method. The longterm mean for the check samples using the old method is 22.7 µg/L. For the new method the mean is 23.5 µg/L, based on 10 results with a standard deviation of 0.9 µg/L. Is the new method equivalent to the old? To answer this question we use the t-test to compare the two mean values. We start by stating exactly what we are trying to decide, in the form of two alternative hypotheses; (i) the means could really be the same, or (ii) the means could really be different. In statistical terminology this is written as: • The null hypothesis (H0): new method mean = long-term check sample mean. • The alternative hypothesis (H1): new method mean ≠ long-term check sample mean. To test the null hypothesis we calculate the t-value as below. Note, the calculated t-value is the ratio of the difference between the means and a measure of the spread (standard deviation) and the amount of data available (n). t 23.5 22.7 2.81 0.9 / 10 In the final step of the significance test we compare the calculated t-value with the critical t-value obtained from tables (4). To look up the critical value we need to know three pieces of information: (i) Are we interested in the direction of the difference between the two means or only that there is a difference, for example, are we performing a onesided or two-sided t-test (see Table 1)? In the case above, it is the latter, therefore, the two-sided critical value is used. (ii) The degrees of freedom: this is simply the number of data points minus one (n–1). (iii) How certain do we want to be about our conclusions? It is normal practice in chemistry to select the 95% confidence level (i.e., about 1 in 20 times we perform the t-test we could arrive at an erroneous conclusion). However, in some situations this is an unacceptable level of error, such as in medical research. In these cases, the 99% or even the 99.9% confidence level can be chosen. – x s Method 1 4.2 4.5 6.8 7.2 4.3 5.40 1.471 Method 2 9.2 4.0 1.9 5.2 3.5 4.76 2.750 table 3 Results from two methods used to determine concentrations of selenium. tcrit = 2.26 at the 95% confidence level for 9 degrees of freedom. As tcalculated > tcrit we can reject the null hypothesis and conclude that we are 95% certain that there is a significant difference between the new and old methods. [Note: This does not mean the new derivatization method should be abandoned. A judgement needs to be made on the economics and on whether the results are ‘fit for purpose’. The significance test is only one piece of information to be considered.] Example 2 (5) Two methods for determining the concentration of Selenium are to be compared. The results from each method are shown in Table 3: Using the t-test for independent sample means we define the null – – hypothesis H0 as x 1 = x 2 This means there is no difference between the means of the two methods (the alternative hypothesis is H1: x–1 ≠ x–2). If the two methods have sample standard deviations that are not significantly different then we can combine (or pool) the standard deviation (Sc). (see What is an F-Test?) Sc 1.4712 (5 1) 2.7502 (5 1) (5 5 2) t 0.64 0.64 0.459 2.205 0.632 1.395 The 95% critical value is 2.306 for n = 8 (n1 + n2 –2 ) degrees of freedom. This exceeds the calculated value of 0.459, thus the null hypothesis (H0) cannot be rejected and we conclude there is no significant difference between the means or the results given by the two methods. Example 3 (5) Two methods are available for determining the concentration of vitamins in foodstuffs. To compare the methods several different sample matrices are prepared using the same technique. Each sample preparation is then divided into two aliquots and readings are obtained using the two methods, ideally commencing at the same time to lessen the possible effects of sample deterioration. The results are shown in Table 4. – The null hypothesis is H0: d = 0 – against the alternative H1: d ≠ 0 The test is a two-tailed test as we are – – interested in both d<0 and d>0 – The mean d = 0.475 and the sample standard deviation of the paired differences is sd = 0.700 t 2.205 If the standard deviations are significantly different then the t-test for un-equal variances should be used (Table 2). Evaluating the test statistic t t (5.40 4.76) 1 1 5 5 => 2.205 0.475 8 1.918 0.700 The tabulated value of tcrit (with n = 7 degrees of freedom, at the 95% confidence limit) is 2.365. Since the calculated value is less than the critical value, H0 cannot be rejected and it follows that there is no difference between the two techniques. Matrix Method 1 2 3 4 5 6 7 8 A (mg/g) 2.52 3.13 4.33 2.25 2.79 3.04 2.19 2.16 B (mg/g) 3.17 5.00 4.03 2.38 3.68 2.94 2.83 2.18 -0.65 -1.87 0.30 -0.13 -0.89 0.10 -0.64 -0.02 Difference (d) table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.