Download SSACgnp.TD883.AOF1.1 How are nutrient data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
SSACgnp.TD883.AOF1.1
Something is Askew at
Mammoth Cave National Park
How are nutrient data distributed and how can we best
communicate the central tendency?
Core Quantitative Issues
Descriptive Statistics and Distribution
Supporting Quantitative Literacy Topic
USGS
Arithmetic Mean and Standard Deviation
Geometric Mean and Multiplicative Standard
Deviation
Graphical representation of data
Logarithms
Core Geoscience Subject
Nutrients in Surface Waters
Amie O. West
Department of Geology, University of South Florida, Tampa, FL 33620
© 2012 University of South Florida Libraries. All rights reserved.
This material is based upon work supported by the National Science Foundation under Grant Number NSF DUE-0836566.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.
1
Getting started
After completing this module you should
be able to:
•Generate a frequency histogram in Excel.
•Understand skewness.
•Understand the characteristics of the
normal distribution.
•Be able to log-transform data.
•Compute the geometric mean.
•Compute and apply the multiplicative
standard deviation.
•Understand the characteristics of the
lognormal distribution.
Kentucky
And you should also know where
Mammoth Cave National Park is.
2
The setting – Mammoth Cave National Park
Mammoth Cave National Park is the most extensive known cave system in the world. It was
recognized as a World Heritage Site in 1981 and as an International Biosphere Reserve in
1990. The cave, which has been forming in stages over the last 10 million years, contains
almost every known type of cave formation and is the most biodiverse cave system known in
the world. The relative stability of the cave environment helps preserve both its features and its
organisms; however, this makes them more sensitive to perturbations such as changes in the
flow and/or chemistry of the air and water. These perturbations are often triggered by
anthropogenic activities at the surface.
Sinkhole plain
Soda straw
Green River
Frozen Niagara
3
Geologic setting
Mammoth Cave formed in the last 10 million years in Mississippian-age limestone (deposited 360 to
320 million years ago). This limestone is capped by the Pennsylvanian-age Big Clifty Sandstone (320
to 300 million years ago). Because sandstones are more resistant to dissolution than limestone, the
Big Clifty Sandstone protected much of the underlying limestone from dissolving; however, erosion
took its toll on the Big Clifty Sandstone, and over the last 10 million years, water made its way into the
limestone to dissolve it. Since the layered rocks of the region were (and still are) tilted to the
northwest, water worked its way along the limestone layers to form the large passages through which
you traverse on most of the Mammoth Cave tours.
4
Hydrologic setting
Mammoth Cave National Park has been sculpted by water over the last ten million years. Today, the
Mammoth Cave Karst aquifer is highly transmissive, meaning it quickly responds to rainfall events,
which means the chemical characteristics of the groundwater are also influenced by rainfall. Because
much of the watershed that contributes to the park’s groundwater and surface rivers and streams lies
outside park borders, nutrient concentrations in the park can be rather variable. Land uses in the
areas surrounding Mammoth Cave National Park range from residential to industrial to agricultural.
Groundwater emerges at many seeps and springs in the park and flows into area rivers and streams.
The Green and Nolan Rivers, partially within Mammoth Cave National Park, are among the most
biodiverse in the United States. Changes in nutrient concentration in these waters could significantly
alter the biological characteristics of Mammoth Cave National Park.
5
Water Quality
There are many water resources inside Mammoth Cave National Park. The two data sets
provided here are nitrate-nitrogen and total phosphorus concentrations in surface waters
within the park. These two nutrients are essential for life, but in excess they can disrupt the
balance of the ecosystem. One concern with these nutrients is their use in fertilizers, both
agricultural and residential. Eutrophication occurs when high levels of one or both of these
nutrients contribute to increased algae growth and the depletion of oxygen in the water. This
can have serious detrimental effects to biota.
Collecting water-quality data can help park officials
understand the baseline concentrations in the park’s
waters and monitor for any effects of land-use changes
inside and outside the park. As of February 2011,
Mammoth Cave National Park waters were not impaired.
This is good news for two reasons. First, we can consider
the establishment of the park as having the desired
effect, to preserve our natural resources. Second, by
describing the data from unimpaired waters we will be
able to recognize if pollutants begin to be introduced and
hopefully be able to quickly institute remediation.
6
The Problem
Nutrient data collections can often be very large and difficult to interpret at first glance.
Frequency histograms and descriptive statistics can communicate these data effectively so they
can be used to identify contamination sources, compare studies with other locations, or to
develop environmental policies.
A frequency histogram can help us depict a data set without making the viewer look at each
and every data point.
Descriptive statistics aim to tell a viewer where most of the data occur and how likely it is that
any measurement will result in a value outside the central tendency.
= cell with a given value
= cell with a formula
Click on the Excel icon to the right and save the file
immediately to your computer. The spreadsheet
contains phosphorus and nitrogen-nitrate
concentration data collected in surface waters of
Mammoth Cave National Park.
Note: You might see “NULL” in some
cells in your spreadsheet. This is
normal, as logging devices sometimes
malfunction and skip measurements.
7
Creating a frequency histogram
To create a histogram in Excel, you must first bin your data. You will need to determine how
large you want your bin. For our phosphorus data we can set the bin size to 0.02. This will give
us a good picture of the frequency distribution.
Create a frequency column next to the bin
column. The frequency command will count
how many times a value in that range
occurs in the data set.
Note: The frequency command cannot just
be dragged down to fill the rest of the
column as is usually done in Excel.
=FREQUENCY(A:A,C:C)
control+shift+enter
(command+return on MAC)
Next, highlight the cells in which you want the frequency values, in this case D2 through D127.
Then highlight the equation bar at the top of the spreadsheet. At the same time press the
control, shift, and enter keys. For more help with creating a frequency histogram click here.
Return to Slide 25
8
Creating a frequency histogram (cont’d)
Now you want to create your chart. Highlight the bin values in column C and the frequency
values in column D and insert a scatter chart.
Note: There is no automatic process for creating a histogram in Excel without installing an
Excel Toolpak, so we will force it. Use the following steps.
Step 1: Double click on one of the markers in the chart to open the format data series window.
Step 2: Choose the Error Bars option on the left.
Step 3: Click on the Y-Error Bars tab on the top and choose Minus.
Step 4: Choose the percentage option and set it to 100%.
Step 5: Finally, on the left, choose Marker Style and select no marker. Then click OK.
Step 2
Step 3
Step 4
Step 5
9
A picture is worth 1000 words
This frequency histogram is a powerful image and can tell you a lot about the data. Just by
looking, you can see where most of the measurements lie and that higher concentrations
sometimes occur.
Don’t forget to label
your axes!
10
Descriptive statistics
Now that we have seen what the data look like in a chart, we need to be able to communicate
what it means in numbers and words. Descriptive statistics are used all the time in everything
from test grades, to income, to how likely you are to get the flu. There are some statistics with
which you are probably already very familiar, the median and average (or arithmetic mean).
Calculate these statistics in your phosphorus spreadsheet.
The median is the value that lies in the middle of the distribution. Exactly one half of the data
are greater than the median, and exactly one half of the data are less than the median.
The arithmetic mean is the center of mass and is calculated by the equation below. Imagine
the data on a seesaw, the fulcrum must be at the arithmetic mean in order to balance.
18
16
The median
=MEDIAN(A2:A180)
Frequency
14
=AVERAGE(A2:A180)
12
10
n
The arithmetic mean
8
åx
6
4
i
i=1
2
0
0
0.2
0.4
0.6
0.8
1
1.2 1.4 1.6 1.8
2
Phosphorus Concentration (mg/L)
2.2
2.4
2.6
2.8
n
11
Descriptive statistics (cont’d)
Another statistic that you may be used to using is the standard deviation. This value gives a
distance above or below the arithmetic mean in which, in many cases, most of the data should
fall. The standard deviation can be determined by the following equation.
n
å(x - x)
Where
2
i
i=1
(n -1)
n
x
x
is the number of observations
is the observed concentration
is the arithmetic mean
=STDEV(A2:A180)
Calculate the standard deviation in your
phosphorus spreadsheet. Luckily, there is
a built-in Excel command to do this.
Calculate the lower and the upper bounds of one standard deviation from the arithmetic mean.
You would report these statistics as
0.42 ± 0.55 (mg/L).
12
The Problem
You may be wondering why the average and median values are so far apart and which one you
should use to describe your data. First let us discuss the median. The median is robust, that is
one or two values in a data set will not change it much, even if they are very large or very small.
However, the arithmetic mean is another story. One very large or very small value could change
it significantly. The standard deviation is also sensitive to high or low values. This sensitivity can
sometimes make the statistics nearly meaningless as descriptors of the central region of the
data set.
To demonstrate this, we can create something like a number line that represents where the
arithmetic mean and one standard deviation tell us most of our data might exist.
-0.13
0.42
0.97
But we know that we cannot observe negative nutrient concentrations. So if we want to consider
our values without those negatives our entire idea turns into an unbalanced seesaw because
we have those higher concentration values that influence our standard deviation.
This is because our data are not normally distributed.
13
The normal distribution
The normal distribution describes data that are symmetric about the median and the
arithmetic mean (which are equal). This is the Gaussian curve (or bell curve) that you may have
seen before. When reporting the mean and standard deviation of normally distributed data,
about 68% of the data will be within one standard deviation of the mean, 95% will be within two
standard deviations, and 99.7% will be within three standard deviations of the mean. The
standard deviations for normally distributed data will look like this.
Assumed distribution of standard deviation about the mean
100
90
80
% of Data
70
minus 3
The arithmetic mean
minus 2
60
minus 1
50
plus 1
40
plus 2
30
plus 3
20
10
0
68%
95%
99.7%
Return to Slide 22
Return to Slide 23 14
The normal distribution (cont’d)
The distribution of standard deviations about the mean for our phosphorus data looks like this.
You can see that our distribution looks nothing like the normal distribution on the previous slide,
and it is certainly not symmetric. Our data are skewed.
Actual Distribution of standard deviation about the mean
100
90
80
minus 3
% of Data
70
minus 2
60
minus 1
50
plus 1
40
plus 2
30
plus 3
20
10
0
89%
89%
97.9%
Return to Slide 22 15
Skewness
Skewness is a statistic that describes the asymmetry of the data. Data that fit a normal
distribution will look like the classic bell curve and will have a skewness of zero. Nutrient data
are very often right-skewed, which means the histogram has a longer tail on the right.
The skewness is calculated by the following equation. It is the third moment about the mean.
This equation would be tricky to type into a single Excel cell. Thankfully, Excel has a command
for calculating skewness.
Where
n æ
n
xi - x ö
ç
÷
å
(n -1)(n - 2) i=1 è s ø
3
n
x
is the number of observations
is the observed concentration
x
is the arithmetic mean
s
is the standard deviation
16
Skewness (cont’d)
Question 1: What can you say about the distribution of the phosphorus concentration
data just by looking at the histogram?
Calculate the skewness of your phosphorus data set using the built-in Excel function.
A positive skewness value
indicates right-skewed data. A
negative skewness value means
the data are left-skewed.
=SKEW(A2:A180)
17
Log-transformation
There must be another way to describe our data, right? Yes.
The geometric mean and multiplicative standard deviation are useful ways of describing
skewed data sets. They are not that different from the average and standard deviation with
which you are already familiar. They are simply performed on log-transformed data.
Using Excel this is becomes a simple process.
Log-transformed data are the logarithms of the original data and can create a more
symmetric histogram. If you’ll remember, the problem with our arithmetic mean and standard
deviation was that they didn’t represent our right-skewed data very well.
Create a column that will calculate the
logarithm (base 10) of each data value
in your phosphorous spreadsheet.
What about those “NULL” cells? If you try
to take the log of those you will have
errors all over the place. Use an Excel
logic function to remedy this. For more
about the logic function click here.
=IFERROR(LOG(A2),””)
18
Log-transformation (cont’d)
Create a frequency histogram of your log-transformed data.
Calculate the median, arithmetic mean, standard deviation, and skewness.
7
6
Frequency
5
4
3
2
1
0
-2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Log10 of Phosphorus Concentration (mg/L)
The median
1.2 1.4 1.6 1.8
2
The arithmetic mean
We can see that our histogram of the log-transformed data is more symmetric than the
previous one and our median and mean values are closer together and located more toward
the center of the data. Our lower skewness value confirms this. These numbers can now be
used to calculate more descriptive statistics.
19
More descriptive statistics
Now what? We have a more symmetric histogram and some statistics, but what can we do with
these numbers? How do we make them mean something?
In order to make these values make more sense we need to perform back-transformation, that
is, undo our transformation. Since we took the log10 of the data, we need to raise 10 to our
values of the transformed data. These values are the geometric mean and the multiplicative
standard deviation of the data. Excel has a command to calculate the geometric mean. For
equations for the geometric mean and multiplicative standard deviation click here.
Calculate the geometric mean with back-transformation and the Excel command. Confirm that
they are equal. Then calculate the multiplicative standard deviation using back-transformation.
=GEOMEAN(A2:A180)
=10^J23
=10^H23
You would report these statistics as
0.23 ×⁄ 2.84 (mg/L).
20
More descriptive statistics (cont’d)
One thing we must remember when
reporting the geometric mean and
multiplicative standard deviation is the
operator, ×⁄ rather than ±.
This means we will divide the geometric
mean by the multiplicative standard
deviation to determine the lower bound,
and multiply to determine the upper bound.
Calculate the lower and the upper bounds of one multiplicative standard deviation from the
geometric mean.
This gives us an asymmetric bracket around the data that lie within one multiplicative standard
deviation of the geometric mean. An asymmetric bracket for asymmetric data, that makes
sense! Notice we do not venture into negative values. We have balanced our seesaw!
0.08
0.23
0.67
=G33/I33
=G33*I33
Don’t throw the baby out with the
bathwater! The arithmetic mean is not
meaningless. We use it to calculate the
total load of the nutrient when we are
given a volume of water.
21
A better fit
If you will recall the images from slide 14 and slide 15, we showed what the standard deviations
should look like and what they actually look like for our distribution. Below is what they look like
by using the geometric mean and the multiplicative standard deviation. While it may not be
completely symmetric, it is a much better fit to the assumed distribution.
Actual distribution of multiplicative standard deviation about the geometric mean
100
90
80
minus 3
% of Data
70
minus 2
60
minus 1
50
plus 1
40
plus 2
30
plus 3
20
10
0
70%
96%
100%
22
One more thing
If our data were lognormal, that is, the logarithms of the data are normally distributed, the chart
on the previous slide would be exactly the same as our distribution chart on slide 14. The
arithmetic mean and median of the log-transformed data would be equal, the skewness of the
log-transformed data would be zero, and the histogram of the log-transformed data would look
exactly like the classic bell curve (shown below in blue).
7
6
Frequency
5
4
3
2
1
0
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8 -0.6 -0.4 -0.2
0
0.2
Log10 of Phosphorus Concentration (mg/L)
0.4
0.6
0.8
1
23
Geometric mean and multiplicative standard deviation
The geometric mean can be calculated from
the original data by the following equation.
The multiplicative standard deviation can
be calculated with the following equation.
n
å( log10 xi -x )
n
n
Õx
i
n
x
10
Where
is the number of observations
is the observed concentration
2
i=1
i=1
Where
*
n
x
x*
(n-1)
is the number of observations
is the observed concentration
is the geometric mean
With Excel it is often much simpler to transform the data and calculate the geometric mean
and multiplicative standard deviation. The geometric mean is simply the back transformation
of the average of the log-transformed data. The multiplicative standard deviation is the backtransformation of the standard deviation of the log-transformed data. That is a mouthful. If
you use log10 to transform your data, then you will raise 10 to the power of the arithmetic
mean of the transformed data to find the geometric mean, and 10 to the power of the
standard deviation of the transformed data to find the multiplicative standard deviation.
Return to Slide 20
24
End-of-module assignment
1. Calculate the mean, median, and standard deviation for the nitrogen-nitrate data set.
2. Calculate the range of one standard deviation about the mean.
3. Create a frequency histogram for the nitrogen-nitrate data set.
Return to Slide 8
35
30
Frequency
25
20
15
10
5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Nitrogen-Nitrate Concentration (mg/L)
25
End-of-module assignment (cont’d)
1. Transform the nitrogen-nitrate data using the log10.
2. Calculate the mean, median, and standard deviation for the transformed data.
3. Calculate the geometric mean and multiplicative standard deviation of the nitrogen-nitrate
data set.
4. Calculate the range of one multiplicative standard deviation about the geometric mean.
5. Create a frequency histogram for the log10 of the nitrogen-nitrate data set.
Return to Slide 8
25
Frequency
20
15
10
5
0
-2
-1.8 -1.6 -1.4 -1.2
-1
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Log10 of Nitrogen-Nitrate Concentration (mg/L)
1.2
1.4
1.6
1.8
2
26
End-of-module assignment (cont’d)
1. Describe the characteristics of skewed data sets.
2. Briefly discuss the difference between the arithmetic mean and the geometric mean.
3. Describe the benefits of the geometric mean in skewed data.
4. Why should we care about how water quality data are summarized?
Blue Spring
Green River
27
References
Slides 2, 3 & 4 – images from NPS
Slide 5 – images from Amie O. West
sources: NPS, U.S. Geological Survey
Slide 6 – image and source: NPS Hydrographic and Impairment Statistics Database
Slide 7 – data from Cumberland Piedmont Network
Slide 16 – image from Bowman’s Website
Slide 27– images from NPS
28