Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Module 7 Data Presentation and Statistical Methods Trainee Manual MODULE 7: DATA PRESENTATION AND STATISTICAL METHODS Timetable 09h00 – 11h00 11h00 – 11h15 11h15 – 12h30 12h30 – 13h30 13h30 – 15h00 15h00 – 15h15 15h15 – Close Lecture material interspersed with Excel exercises 1 and 2. Tea break. Lecture material interspersed with Excel exercises 3 and 4. Lunch. Lecture material interspersed with Excel exercises 5 and 6. Tea break. Mastery test Page 7-1 Module 7 Data Presentation and Statistical Methods TABLE OF CONTENTS 1. Module Objectives .................................................................................................................3 2. Introduction ............................................................................................................................4 3. Statistics that describe the central position in a data set ........................................................4 4. Describing the dispersion or spread of data about the central position statistics ...................4 5. The difference between standard deviation and standard error of the average .....................5 6. How to determine if a data series is normally distributed? .....................................................5 7. Using Excel to calculate descriptive statistics ........................................................................6 7.1. EXERCISE ONE ..............................................................................................................7 8. Comparing the sample average against a water quality standard ..........................................7 8.1. EXERCISE TWO ..............................................................................................................7 8.2. EXERCISE THREE ..........................................................................................................9 9. Comparing samples from two different sites ..........................................................................9 9.1. EXERCISE FOUR ............................................................................................................9 10. Comparing two samples that are not independent ...............................................................11 10.1. EXERCISE FIVE ............................................................................................................11 11. Data presentation and time series analysis ..........................................................................12 11.1. EXERCISE SIX ..............................................................................................................12 12. Reference Books .................................................................................................................16 Page 7-2 Module 7 Data Presentation and Statistical Methods 1. Module Objectives In this module, you will be introduced to some basic statistics that are relevant to marine pollution control, in particular, to the interpretation of water sample measurements in relation to water quality guidelines and standards. The emphasis of the module is on (1) computing the necessary statistics, (2) interpreting the statistics obtained and (3) the presentation of data. To compute the statistics Excel functions that are preprogrammed within the spreadsheet are used. The actual computational equations of the various statistics will not be given -- any basic statistics textbook can be consulted for them (see reference books). Page 7-3 Module 7 Data Presentation and Statistical Methods 2. Introduction Good decision-makers base their decisions on the best available information on hand. Information is derived from data and best available implies that there is some loss of information, probably because of sampling or measurement error during collection of the data. In other words, there is some uncertainty. All information (whether from books, journals etc.) is ultimately derived from data and the above can be summarized by the following formula: DATA = INFORMATION + ERROR It should be obvious that good data results in good information with little error. Poor data, on the other hand, results in a loss of information due to large errors. Information to be extracted from data is obtained in the following ways or methods: The data can be plotted, by means of graphs, to determine trends. When trends or patterns are identified they can be used to predict future values or states, or The data can be analyzed i.e. some statistic, such as the average, can be computed which is able to summarize a large amount of data into information that is easy to understand, or Data can be modeled, usually using a simply mathematical model that helps explain how a system works. Statistics is therefore, a discipline that attempts to increase information from data by fully understanding how error is incorporated into data. 3. Statistics that describe the central position in a data set When confronted with a large body of data, one is inclined to summarize the data into smaller bits of information that are easier to understand. This also facilitates comparison with other data sets. As a first step, one tries to describe the central features of the data. Statistics that describe the central position of a data set are the average and the median. There are other measures of central locations such as the mode, which is not used that often and is not very useful. We will concentrate on the average and the median. The average is only used when the underlying frequency distribution of the sampled data follows a normal distribution i.e. the frequency distribution is bell-shaped. If the distribution is not symmetrical about the average i.e. the distribution is skewed either to the left or to the right or forms some other shape (the distribution is non-normal), then the median is used. 4. Describing the dispersion or spread of data about the central position statistics Two statistics (the standard deviation and the variance) describe the spread of data points about the average. These two statistics are related in the following way: the standard deviation is equal to the square root of the variance. A small standard deviation indicates that the spread (or range) about the average is small while a large standard deviation indicate that the spread is large. The standard deviation is used to Page 7-4 Module 7 Data Presentation and Statistical Methods calculate a confidence interval about the average, which contains a given percentage of the data points. For example: AVERAGE 1 STANDARD DEVIATION -- within this range 68% of the sample data points are included. AVERAGE 2 STANDARD DEVIATIONS -- within this range 95% of the sample data points are included. Note that the AVERAGE – 2 STANDARD DEVIATIONS describes the left 95% confidence interval and AVERAGE + 2 STANDARD DEVIATIONS describes the right 95% confidence interval. AVERAGE 3 STANDARD DEVIATIONS -- within this range 99% of the sample data points are included. In the case of the median, the FIRST QUARTILE (25th percentile) and the THIRD QUARTILE (75th percentile) describe the spread of data points around the median. In fact, the median is the 50th percentile. The Xth percentile is a data point at which X% of the points lie below this data value. For example, the first quartile indicates that 25% of the sample data points are below its value while 75% of the data points are above its value. Other percentiles can be interpreted in a similar fashion. The difference between the third and first quartile is referred to as the interquartile range. 5. The difference between standard deviation and standard error of the average As previously explained the standard deviation describes the variability or spread of data around the average. On the other hand, the standard error indicates how precisely the average is estimated. If the standard error is a large value then the average is poorly estimated. The standard deviation and the standard error are related by the following equation: Standard error Standard deviation No. of observations in sample It should be apparent that as the number of observations increase in the sample, the smaller the estimate of the standard error. Hence, a large sample always provides a more precise estimate of the average. Given an estimate of the average and the standard error of the average (both derived from a sample), one can say with 95% confidence that the true population average lies between the interval AVERAGE 2 STANDARD ERRORS. 6. How to determine if a data series is normally distributed? Earlier it was mentioned that the average and standard deviation statistics should only be used to describe data sets that are normally distributed while the median and its first and third quartiles should be used to describe non-normal distributions. How do we decide if a distribution is normally distributed? A normal distribution has a number of characteristics that can be tested for. In a normal distribution: The average and median are the same or very close together, and The skew statistic is close to or equal to zero. Skewness characterizes the degree of asymmetry of a distribution around its average. Positive skewness (a large positive skew Page 7-5 Module 7 Data Presentation and Statistical Methods statistic) indicates a distribution with an asymmetric tail extending toward the right. Negative skewness (a large negative skew statistic) indicates a distribution with an asymmetric tail extending toward the left. A third means of testing whether a distribution is normal or not is to draw up a frequency histogram and to visually check if the curve is bell-shaped. A histogram can be drawn in Excel by clicking on the tools/data analysis and then choosing the histogram submenu. One has to decide, prior to drawing the histogram, the number of size classes to be used in the bin. A general rule is that the number of size classes = square root of the number of observations. This value is then rounded up or down for convenience. The increment to which each size class is increased = ((the maximum value in the sample – the minimum value in the sample)/ the square root of the number of observations). This value is then rounded up or down for convenience. 7. Using Excel to calculate descriptive statistics The statistics described so far are usually referred to as descriptive statistics. The following table summarizes some of them and includes the Excel functions that calculate them. Statistic Average Excel function =AVERAGE(array1) Standard deviation =STDEV(array) Variance =VAR(array) Standard error of the average =STDEV(array)/SQRT(COUNT(array))2 Median First Quartile Minimum value =MEDIAN(array) =QUARTILE(array, 1)3 =PERCENTILE(array, 0.25) =QUARTILE(array, 3) =PERCENTILE(array, 0.75) =MIN(array) Maximum value =MAX(array) Third Quartile Uses and remarks Estimates central location in a normally distributed sample Describes variability about the average and can be used to calculate a range within which a certain percentage of the samples are included Equals the standard deviation squared Used to calculate confidence range for population average The 50th percentile The 25th percentile The 75th percentile Calculates the minimum value in a sample Calculates the maximum value in a sample Array refers to the cell address range that stores the sample data points. For example if observations are entered in column A in rows 1 to 100, then the array address is A1: A100. 2 This formula consists of a number of nested Excel functions: =STDEV(), =COUNT() and SQRT() with the latter function being the square root function. 3 There are two different ways of calculating the quartiles in Excel. 1 Page 7-6 Module 7 Data Presentation and Statistical Methods Number of size classes =SQRT(COUNT(ARRAY)) Size class increment =(MAX(array)MIN(array))/SQRT(COUNT(array)) Number of data points =COUNT(array) Skew =SKEW(array) 7.1. To determine the number of size classes to include when drawing a histogram. Can be rounded up or down To determine the size class increment. Can be rounded up or down Counts the number of observations in a sample To test the symmetry of a sample EXERCISE ONE Open the Excel file, MODULE 7 STATISTICS.XLS and go to the sheet named EXERCISE ONE. In the column headed E. coli a hundred data points are given. Each observation is from a water sample and each value represents the number of E. coli bacteria counted in 100ml of water. The water samples come from a mariculture farm that is growing mussels for human consumption. The 100 water samples were drawn randomly i.e. a 100 days in a year were chosen randomly and a single water sample was taken for the selected day. The water sample was drawn from the same area of the pond each time. 1. Determine by means of a histogram if the bacterial counts are normally distributed? Indicate how you determined the number of size classes and the class increment. 2. Confirm your choice of distribution of 1 above by estimating the following statistics: the average, the median and the skewness. Give reasons for your choice of distribution (normal, skewed left or right). 3. Use the data set to practice entering the Excel functions given in the table above. 8. Comparing the sample average against a water quality standard 8.1. EXERCISE TWO At times, one would like to know if a sample average is below a certain standard value taking into account that, there is variability about the sample average. For example, according to the South African water quality guidelines for coastal marine waters the maximum acceptable concentration of mercury in seawater is 0.3 g/l. In EXERCISE TWO of your Excel spreadsheet the concentration of 20 mercury samples are given in g/l. The samples were taken from a 40km stretch of beach on the same day. One would like to know - is the average of these samples below or equal to the standard value, at a confidence level of 95%? This statement is referred to as the null hypothesis. The alternate hypothesis is that the sample average is greater than the standard value at a confidence level of 95%. To answer the question we make use of the Student’s t-test. Essentially the test involves calculating a t-statistic (using the given data) and then comparing the t-statistic to a t-critical value. The latter is calculated at a particular confidence level (usually 95%). From this Page 7-7 Module 7 Data Presentation and Statistical Methods comparison we can either reject or accept the null hypothesis. The t-statistic is calculated as follows: t statistic Sample Average - Standard Value Variance of the sample/number of data points In Excel the t-statistic formula is written as: =(AVERAGE(A2:A21)-0.3)/SQRT(VAR(A2:A21)/20) where the cells range A2:A21 contains the sample mercury concentrations. STEP 1 State the null hypothesis and the alternate hypothesis Null Hypothesis: The sample average is less than or equal to 0.3 Alternate hypothesis: The sample average is greater than 0.3 STEP 2 Check to see if the sample is normally distributed. The Student’s t-test is only applicable to a normally distributed sample. STEP 3 Calculate the t-statistic in Excel. The answer should equal -1.05 STEP 4 Calculate the t-critical value. To calculate the t-critical value we use a function in Excel called the =TINV() function. This function requires two input values termed the probability and degrees of freedom. The function is therefore written as =TINV(X, Y) where X is the probability and Y is the degrees of freedom. The probability is calculated as follows: Probability = (1-Confidence level as a fraction)*2 The probability at a confidence level of 95% is therefore (1-0.95)*2 = 0.1. Note that the above probability equation ONLY APPLIES TO A ONE-TAIL TEST. If a probability for a two-tailed test is required then the probability = (1- Confidence level as a fraction). The degrees of freedom = (number of data points –1) i.e. df = 20-1=19. The final formula should look like this =TINV(0.1,19) and should equal 1.73. STEP 5 Compare the t-statistic with the t-critical value. If the t-statistic is greater than or equal to the tcritical value, we reject the null hypothesis. In this case the t-statistic (-1.05) is less than the tcritical value of 1.73 we therefore cannot reject the null hypothesis. We therefore conclude that the sample average is statistically significantly less than or equal to 0.3. Page 7-8 Module 7 8.2. Data Presentation and Statistical Methods EXERCISE THREE Repeat the above Student’s t-test for a confidence level of 99%. 9. Comparing samples from two different sites Suppose one collects data from two different sites and wants to compare the two sites to see if there is a statistical difference between the average values for these sites. Again, a Student’s ttest is employed to make such a comparison. In this case, the test is based on a number of assumptions that need to be tested: The observations in each sample must follow a normal distribution. The observations from each sample must be independent of each other. Consideration must also be given to the variances of the samples. If the variances are equal, we then apply the Student’s t-test called “t-test: two-sample assuming equal variances”. If the variances are not equal, we apply the test called “t-test: two-sample assuming unequal variances”. Both of these statistical tests are found in the tools/data analysis submenu in Excel. In the following exercise, a statistical test called an F-test will be employed to check if the variances of two samples are equal. This test is also found under the tools/data analysis submenu in Excel. It is called the “F-test two-sample for variances” in Excel. 9.1. EXERCISE FOUR In EXERCISE FOUR, pH values are given. They come from two rivers in KwaZulu-Natal: the Umgeni and the Umfolozi. The pH was measured with an electronic pH meter. The pH values were measured over the months of December 2000 to January 2001 and days were randomly chosen over this period. Notice that there are an unequal number of samples taken from the two rivers. The following steps check if the variances of the two samples are equal at a 95% confidence level. STEP 1 Check if both samples come from a normal distribution. Now state the null hypothesis and the alternate hypothesis. Null hypothesis: The variances are equal. Alternate hypothesis: The variances are not equal. STEP 2 Click on the tools/data analysis menu and move to the “F-test two-sample for variances” submenu. Fill in the input values as follows. Note Alpha refers to the probability level, which is equal to (1- confidence level as a fraction). Now click OK. Page 7-9 Module 7 Data Presentation and Statistical Methods STEP 3 The output should look the following diagram. STEP 4 Now compare the F-statistic to the F-critical value. The F-statistic is less than the F-critical value and hence we cannot reject the null hypothesis. We therefore conclude that the two samples have the same variances. We continue the exercise to see if there is a significant difference, at a confidence level of 95%, between the sample averages. STEP 1 State the null hypothesis and the alternate hypothesis. Null hypothesis: The average pH of the Umgeni = the average pH of the Umfolozi. Alternate hypothesis: The average pH of the Umgeni the average pH of the Umfolozi. STEP 2 Open the submenu “T-test: two-sample with equal variance” found under the tools/data analysis menu. Fill in the input values as follows. Now click OK. Page 7-10 Module 7 Data Presentation and Statistical Methods STEP 3 The output should look like the following diagram. STEP 4 Now compare the t-statistic with the two-tailed t-critical value. If the t-statistic is less than the tcritical value then reject the null hypothesis. 10. Comparing two samples that are not independent As stated previously when comparing two averages it is assumed that the two samples are independent. There are situations when this assumption may not be valid. A case in question may be the measurement of a pollutant before and after a clean up. To test if there is a significant difference between these samples a “t-test: paired two-sample test for means” is then applied. This test is also found under the tools/data analysis menu in Excel. 10.1. EXERCISE FIVE A sewage pipeline was damaged by a storm and raw sewage has leaked into an estuary. During the contamination 10 samples of E. coli counts were made from ten different sites in the estuary. These data are found in column A, in the sheet headed EXERCISE FIVE. Subsequently, the pipe was repaired and after a few weeks a further ten water samples where taken from exactly the same sites. These samples were also sent for analysis of E. coli. These Page 7-11 Module 7 Data Presentation and Statistical Methods counts are given in column B. Determine at a 95% confidence level whether there is a significant difference between these two samples. 11. Data presentation and time series analysis Water quality data can be presented either in tabular or graphical form. Tables are used to convey quantitative information, such as the exact values of observations. Graphs on the other hand convey the general behavior of a data set, highlighting patterns and trends. Trends and patterns can be temporal or seasonal. In the following exercise we will use graphical means to (1) identify seasonal patterns and (2) to present the data in a form that easily conveys this information to the reader. It is important to note that time series analysis is a very complicated statistical procedure and is beyond this introductory course. 11.1. EXERCISE SIX In EXERCISE SIX, the date and concentration of dissolved oxygen (in mg/l) are given. The readings were taken from the Umgeni River over a period of time. We suspect that this data is very seasonal but a visual examination of the data does not reveal so (see below). To reveal these seasonal patterns we need to plot the data in a type of graph called an XY chart, where on the X-axis, time is plotted and on the Y-axis, the dissolved oxygen concentration is plotted. To plot such a graph we “block-off” the data to be plotted (see below). Page 7-12 Module 7 Data Presentation and Statistical Methods Then click on insert and choose the submenu charts. It should look like the following: The next step would look like this: Page 7-13 Module 7 Data Presentation and Statistical Methods Fill in the following: Your final chart or graph should look like this: Page 7-14 Module 7 Data Presentation and Statistical Methods Time series analysis 12 Dissolved oxygen (mg/l) 10 8 6 4 2 0 31-Jan-93 19-Aug-93 7-Mar-94 23-Sep-94 11-Apr-95 28-Oct-95 15-May-96 1-Dec-96 19-Jun-97 Date (dd-mm-yy) Now modify your chart until it resembles the following chart: To modify any aspect of your chart click on the particular aspect that needs modification and then right-click to perform the modification. The chart shows a curve with peaks and troughs. To check if the peaks occur over a particular season hold the cursor over a peak data point and the X and Y co-ordinates of that point will be shown. Do the same for the troughs. Can you draw any general conclusions? Page 7-15 5-Jan-98 Module 7 Data Presentation and Statistical Methods Some tips when drawing graphs Content: the chart should contain the minimum amount of detail to convey the desired information. Complex charts obscure important features. Use separate charts for complicated subjects. Size of chart: try to fit a chart into a page or less. Charts that extend over two pages are difficult to read. Title and/or legend: Every chart must have a title. The legend must have enough detail to interpret the chart without reading the surrounding text. Axes numbers: these should be large and easy to read. Axes labels: all axes must have labels with the unit of measurement. Scale: the chosen scale must be appropriate for the size of the chart and to facilitate interpolation if required. Gridlines are helpful when conveying quantitative information. If multiple charts are used, they should preferably have the same scale for easy comparison. 12. Reference Books Phillips, L.P, Jr. 1992. How to think about statistics. Revised edition. Zar, J.H. 1984. Biostatistical Analysis. 2nd Edition. Page 7-16