Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
SSACgnp.TD883.AOF1.1 Something is Askew at Mammoth Cave National Park How are nutrient data distributed and how can we best communicate the central tendency? Core Quantitative Issues Descriptive Statistics and Distribution Supporting Quantitative Literacy Topic USGS Arithmetic Mean and Standard Deviation Geometric Mean and Multiplicative Standard Deviation Graphical representation of data Logarithms Core Geoscience Subject Nutrients in Surface Waters Amie O. West Department of Geology, University of South Florida, Tampa, FL 33620 © 2012 University of South Florida Libraries. All rights reserved. This material is based upon work supported by the National Science Foundation under Grant Number NSF DUE-0836566. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 1 Getting started After completing this module you should be able to: •Generate a frequency histogram in Excel. •Understand skewness. •Understand the characteristics of the normal distribution. •Be able to log-transform data. •Compute the geometric mean. •Compute and apply the multiplicative standard deviation. •Understand the characteristics of the lognormal distribution. Kentucky And you should also know where Mammoth Cave National Park is. 2 The setting – Mammoth Cave National Park Mammoth Cave National Park is the most extensive known cave system in the world. It was recognized as a World Heritage Site in 1981 and as an International Biosphere Reserve in 1990. The cave, which has been forming in stages over the last 10 million years, contains almost every known type of cave formation and is the most biodiverse cave system known in the world. The relative stability of the cave environment helps preserve both its features and its organisms; however, this makes them more sensitive to perturbations such as changes in the flow and/or chemistry of the air and water. These perturbations are often triggered by anthropogenic activities at the surface. Sinkhole plain Soda straw Green River Frozen Niagara 3 Geologic setting Mammoth Cave formed in the last 10 million years in Mississippian-age limestone (deposited 360 to 320 million years ago). This limestone is capped by the Pennsylvanian-age Big Clifty Sandstone (320 to 300 million years ago). Because sandstones are more resistant to dissolution than limestone, the Big Clifty Sandstone protected much of the underlying limestone from dissolving; however, erosion took its toll on the Big Clifty Sandstone, and over the last 10 million years, water made its way into the limestone to dissolve it. Since the layered rocks of the region were (and still are) tilted to the northwest, water worked its way along the limestone layers to form the large passages through which you traverse on most of the Mammoth Cave tours. 4 Hydrologic setting Mammoth Cave National Park has been sculpted by water over the last ten million years. Today, the Mammoth Cave Karst aquifer is highly transmissive, meaning it quickly responds to rainfall events, which means the chemical characteristics of the groundwater are also influenced by rainfall. Because much of the watershed that contributes to the park’s groundwater and surface rivers and streams lies outside park borders, nutrient concentrations in the park can be rather variable. Land uses in the areas surrounding Mammoth Cave National Park range from residential to industrial to agricultural. Groundwater emerges at many seeps and springs in the park and flows into area rivers and streams. The Green and Nolan Rivers, partially within Mammoth Cave National Park, are among the most biodiverse in the United States. Changes in nutrient concentration in these waters could significantly alter the biological characteristics of Mammoth Cave National Park. 5 Water Quality There are many water resources inside Mammoth Cave National Park. The two data sets provided here are nitrate-nitrogen and total phosphorus concentrations in surface waters within the park. These two nutrients are essential for life, but in excess they can disrupt the balance of the ecosystem. One concern with these nutrients is their use in fertilizers, both agricultural and residential. Eutrophication occurs when high levels of one or both of these nutrients contribute to increased algae growth and the depletion of oxygen in the water. This can have serious detrimental effects to biota. Collecting water-quality data can help park officials understand the baseline concentrations in the park’s waters and monitor for any effects of land-use changes inside and outside the park. As of February 2011, Mammoth Cave National Park waters were not impaired. This is good news for two reasons. First, we can consider the establishment of the park as having the desired effect, to preserve our natural resources. Second, by describing the data from unimpaired waters we will be able to recognize if pollutants begin to be introduced and hopefully be able to quickly institute remediation. 6 The Problem Nutrient data collections can often be very large and difficult to interpret at first glance. Frequency histograms and descriptive statistics can communicate these data effectively so they can be used to identify contamination sources, compare studies with other locations, or to develop environmental policies. A frequency histogram can help us depict a data set without making the viewer look at each and every data point. Descriptive statistics aim to tell a viewer where most of the data occur and how likely it is that any measurement will result in a value outside the central tendency. = cell with a given value = cell with a formula Click on the Excel icon to the right and save the file immediately to your computer. The spreadsheet contains phosphorus and nitrogen-nitrate concentration data collected in surface waters of Mammoth Cave National Park. Note: You might see “NULL” in some cells in your spreadsheet. This is normal, as logging devices sometimes malfunction and skip measurements. 7 Creating a frequency histogram To create a histogram in Excel, you must first bin your data. You will need to determine how large you want your bin. For our phosphorus data we can set the bin size to 0.02. This will give us a good picture of the frequency distribution. Create a frequency column next to the bin column. The frequency command will count how many times a value in that range occurs in the data set. Note: The frequency command cannot just be dragged down to fill the rest of the column as is usually done in Excel. =FREQUENCY(A:A,C:C) control+shift+enter (command+return on MAC) Next, highlight the cells in which you want the frequency values, in this case D2 through D127. Then highlight the equation bar at the top of the spreadsheet. At the same time press the control, shift, and enter keys. For more help with creating a frequency histogram click here. Return to Slide 25 8 Creating a frequency histogram (cont’d) Now you want to create your chart. Highlight the bin values in column C and the frequency values in column D and insert a scatter chart. Note: There is no automatic process for creating a histogram in Excel without installing an Excel Toolpak, so we will force it. Use the following steps. Step 1: Double click on one of the markers in the chart to open the format data series window. Step 2: Choose the Error Bars option on the left. Step 3: Click on the Y-Error Bars tab on the top and choose Minus. Step 4: Choose the percentage option and set it to 100%. Step 5: Finally, on the left, choose Marker Style and select no marker. Then click OK. Step 2 Step 3 Step 4 Step 5 9 A picture is worth 1000 words This frequency histogram is a powerful image and can tell you a lot about the data. Just by looking, you can see where most of the measurements lie and that higher concentrations sometimes occur. Don’t forget to label your axes! 10 Descriptive statistics Now that we have seen what the data look like in a chart, we need to be able to communicate what it means in numbers and words. Descriptive statistics are used all the time in everything from test grades, to income, to how likely you are to get the flu. There are some statistics with which you are probably already very familiar, the median and average (or arithmetic mean). Calculate these statistics in your phosphorus spreadsheet. The median is the value that lies in the middle of the distribution. Exactly one half of the data are greater than the median, and exactly one half of the data are less than the median. The arithmetic mean is the center of mass and is calculated by the equation below. Imagine the data on a seesaw, the fulcrum must be at the arithmetic mean in order to balance. 18 16 The median =MEDIAN(A2:A180) Frequency 14 =AVERAGE(A2:A180) 12 10 n The arithmetic mean 8 åx 6 4 i i=1 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Phosphorus Concentration (mg/L) 2.2 2.4 2.6 2.8 n 11 Descriptive statistics (cont’d) Another statistic that you may be used to using is the standard deviation. This value gives a distance above or below the arithmetic mean in which, in many cases, most of the data should fall. The standard deviation can be determined by the following equation. n å(x - x) Where 2 i i=1 (n -1) n x x is the number of observations is the observed concentration is the arithmetic mean =STDEV(A2:A180) Calculate the standard deviation in your phosphorus spreadsheet. Luckily, there is a built-in Excel command to do this. Calculate the lower and the upper bounds of one standard deviation from the arithmetic mean. You would report these statistics as 0.42 ± 0.55 (mg/L). 12 The Problem You may be wondering why the average and median values are so far apart and which one you should use to describe your data. First let us discuss the median. The median is robust, that is one or two values in a data set will not change it much, even if they are very large or very small. However, the arithmetic mean is another story. One very large or very small value could change it significantly. The standard deviation is also sensitive to high or low values. This sensitivity can sometimes make the statistics nearly meaningless as descriptors of the central region of the data set. To demonstrate this, we can create something like a number line that represents where the arithmetic mean and one standard deviation tell us most of our data might exist. -0.13 0.42 0.97 But we know that we cannot observe negative nutrient concentrations. So if we want to consider our values without those negatives our entire idea turns into an unbalanced seesaw because we have those higher concentration values that influence our standard deviation. This is because our data are not normally distributed. 13 The normal distribution The normal distribution describes data that are symmetric about the median and the arithmetic mean (which are equal). This is the Gaussian curve (or bell curve) that you may have seen before. When reporting the mean and standard deviation of normally distributed data, about 68% of the data will be within one standard deviation of the mean, 95% will be within two standard deviations, and 99.7% will be within three standard deviations of the mean. The standard deviations for normally distributed data will look like this. Assumed distribution of standard deviation about the mean 100 90 80 % of Data 70 minus 3 The arithmetic mean minus 2 60 minus 1 50 plus 1 40 plus 2 30 plus 3 20 10 0 68% 95% 99.7% Return to Slide 22 Return to Slide 23 14 The normal distribution (cont’d) The distribution of standard deviations about the mean for our phosphorus data looks like this. You can see that our distribution looks nothing like the normal distribution on the previous slide, and it is certainly not symmetric. Our data are skewed. Actual Distribution of standard deviation about the mean 100 90 80 minus 3 % of Data 70 minus 2 60 minus 1 50 plus 1 40 plus 2 30 plus 3 20 10 0 89% 89% 97.9% Return to Slide 22 15 Skewness Skewness is a statistic that describes the asymmetry of the data. Data that fit a normal distribution will look like the classic bell curve and will have a skewness of zero. Nutrient data are very often right-skewed, which means the histogram has a longer tail on the right. The skewness is calculated by the following equation. It is the third moment about the mean. This equation would be tricky to type into a single Excel cell. Thankfully, Excel has a command for calculating skewness. Where n æ n xi - x ö ç ÷ å (n -1)(n - 2) i=1 è s ø 3 n x is the number of observations is the observed concentration x is the arithmetic mean s is the standard deviation 16 Skewness (cont’d) Question 1: What can you say about the distribution of the phosphorus concentration data just by looking at the histogram? Calculate the skewness of your phosphorus data set using the built-in Excel function. A positive skewness value indicates right-skewed data. A negative skewness value means the data are left-skewed. =SKEW(A2:A180) 17 Log-transformation There must be another way to describe our data, right? Yes. The geometric mean and multiplicative standard deviation are useful ways of describing skewed data sets. They are not that different from the average and standard deviation with which you are already familiar. They are simply performed on log-transformed data. Using Excel this is becomes a simple process. Log-transformed data are the logarithms of the original data and can create a more symmetric histogram. If you’ll remember, the problem with our arithmetic mean and standard deviation was that they didn’t represent our right-skewed data very well. Create a column that will calculate the logarithm (base 10) of each data value in your phosphorous spreadsheet. What about those “NULL” cells? If you try to take the log of those you will have errors all over the place. Use an Excel logic function to remedy this. For more about the logic function click here. =IFERROR(LOG(A2),””) 18 Log-transformation (cont’d) Create a frequency histogram of your log-transformed data. Calculate the median, arithmetic mean, standard deviation, and skewness. 7 6 Frequency 5 4 3 2 1 0 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Log10 of Phosphorus Concentration (mg/L) The median 1.2 1.4 1.6 1.8 2 The arithmetic mean We can see that our histogram of the log-transformed data is more symmetric than the previous one and our median and mean values are closer together and located more toward the center of the data. Our lower skewness value confirms this. These numbers can now be used to calculate more descriptive statistics. 19 More descriptive statistics Now what? We have a more symmetric histogram and some statistics, but what can we do with these numbers? How do we make them mean something? In order to make these values make more sense we need to perform back-transformation, that is, undo our transformation. Since we took the log10 of the data, we need to raise 10 to our values of the transformed data. These values are the geometric mean and the multiplicative standard deviation of the data. Excel has a command to calculate the geometric mean. For equations for the geometric mean and multiplicative standard deviation click here. Calculate the geometric mean with back-transformation and the Excel command. Confirm that they are equal. Then calculate the multiplicative standard deviation using back-transformation. =GEOMEAN(A2:A180) =10^J23 =10^H23 You would report these statistics as 0.23 ×⁄ 2.84 (mg/L). 20 More descriptive statistics (cont’d) One thing we must remember when reporting the geometric mean and multiplicative standard deviation is the operator, ×⁄ rather than ±. This means we will divide the geometric mean by the multiplicative standard deviation to determine the lower bound, and multiply to determine the upper bound. Calculate the lower and the upper bounds of one multiplicative standard deviation from the geometric mean. This gives us an asymmetric bracket around the data that lie within one multiplicative standard deviation of the geometric mean. An asymmetric bracket for asymmetric data, that makes sense! Notice we do not venture into negative values. We have balanced our seesaw! 0.08 0.23 0.67 =G33/I33 =G33*I33 Don’t throw the baby out with the bathwater! The arithmetic mean is not meaningless. We use it to calculate the total load of the nutrient when we are given a volume of water. 21 A better fit If you will recall the images from slide 14 and slide 15, we showed what the standard deviations should look like and what they actually look like for our distribution. Below is what they look like by using the geometric mean and the multiplicative standard deviation. While it may not be completely symmetric, it is a much better fit to the assumed distribution. Actual distribution of multiplicative standard deviation about the geometric mean 100 90 80 minus 3 % of Data 70 minus 2 60 minus 1 50 plus 1 40 plus 2 30 plus 3 20 10 0 70% 96% 100% 22 One more thing If our data were lognormal, that is, the logarithms of the data are normally distributed, the chart on the previous slide would be exactly the same as our distribution chart on slide 14. The arithmetic mean and median of the log-transformed data would be equal, the skewness of the log-transformed data would be zero, and the histogram of the log-transformed data would look exactly like the classic bell curve (shown below in blue). 7 6 Frequency 5 4 3 2 1 0 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 Log10 of Phosphorus Concentration (mg/L) 0.4 0.6 0.8 1 23 Geometric mean and multiplicative standard deviation The geometric mean can be calculated from the original data by the following equation. The multiplicative standard deviation can be calculated with the following equation. n å( log10 xi -x ) n n Õx i n x 10 Where is the number of observations is the observed concentration 2 i=1 i=1 Where * n x x* (n-1) is the number of observations is the observed concentration is the geometric mean With Excel it is often much simpler to transform the data and calculate the geometric mean and multiplicative standard deviation. The geometric mean is simply the back transformation of the average of the log-transformed data. The multiplicative standard deviation is the backtransformation of the standard deviation of the log-transformed data. That is a mouthful. If you use log10 to transform your data, then you will raise 10 to the power of the arithmetic mean of the transformed data to find the geometric mean, and 10 to the power of the standard deviation of the transformed data to find the multiplicative standard deviation. Return to Slide 20 24 End-of-module assignment 1. Calculate the mean, median, and standard deviation for the nitrogen-nitrate data set. 2. Calculate the range of one standard deviation about the mean. 3. Create a frequency histogram for the nitrogen-nitrate data set. Return to Slide 8 35 30 Frequency 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Nitrogen-Nitrate Concentration (mg/L) 25 End-of-module assignment (cont’d) 1. Transform the nitrogen-nitrate data using the log10. 2. Calculate the mean, median, and standard deviation for the transformed data. 3. Calculate the geometric mean and multiplicative standard deviation of the nitrogen-nitrate data set. 4. Calculate the range of one multiplicative standard deviation about the geometric mean. 5. Create a frequency histogram for the log10 of the nitrogen-nitrate data set. Return to Slide 8 25 Frequency 20 15 10 5 0 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Log10 of Nitrogen-Nitrate Concentration (mg/L) 1.2 1.4 1.6 1.8 2 26 End-of-module assignment (cont’d) 1. Describe the characteristics of skewed data sets. 2. Briefly discuss the difference between the arithmetic mean and the geometric mean. 3. Describe the benefits of the geometric mean in skewed data. 4. Why should we care about how water quality data are summarized? Blue Spring Green River 27 References Slides 2, 3 & 4 – images from NPS Slide 5 – images from Amie O. West sources: NPS, U.S. Geological Survey Slide 6 – image and source: NPS Hydrographic and Impairment Statistics Database Slide 7 – data from Cumberland Piedmont Network Slide 16 – image from Bowman’s Website Slide 27– images from NPS 28