Download Statistics in Water Quality Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Application of Statistical
Techniques to Interpretation
of Water Monitoring Data
Eric Smith, Golde Holtzman,
and Carl Zipper
Outline
I. Water quality data: program design (CEZ, 15 min)
II. Characteristics of water-quality data (CEZ, 15 min)
III. Describing water quality(GIH, 30 min)
IV. Data analysis for making decisions
A, Compliance with numerical standards (EPS, 45
min)
Dinner Break
B, Locational / temporal comparisons (“cause and
effect”) (EPS, 45)
C, Detection of water-quality trends (GIH, 60 min)
III. Describing water quality
(GIH, 30 min)
• Rivers and streams are an essential component of
the biosphere
• Rivers are alive
• Life is characterized by variation
• Statistics is the science of variation
• Statistical Thinking/Statistical Perspective
• Thinking in terms of variation
• Thinking in terms of distribution
The present problem is multivariate
•
•
•
•
WATER QUALITY as a function of
TIME, under the influence of co-variates like
FLOW, at multiple
LOCATIONS
Water Variable
WQ variable versus time
Time in Years
Bear Creek below Town of
Wise STP
9
PH
8.5
8
7.5
7
6.5
1973/12/14
1978/12/14
1983/12/14
DATE
1988/12/14
1993/12/14
Water Quality
Univariate WQ Variable
Time
Water Quality
Univariate WQ Variable
Water Quality
Time
Univariate Perspective,
Real Data (pH below STP)
9
8.5
8
7.5
7
6.5
6.5
7
7.5
8
8.5
9
The three most important pieces of
information in a sample:
• Central Location
– Mean, Median, Mode
• Dispersion
– Range, Standard Deviation,
Inter Quartile Range
• Shape
– Symmetry, skewness, kurtosis
Central Location: Sample Mean
•
•
•
•
(Sum of all observations) / (sample size)
Center of gravity of the distribution
depends on each observation
therefore sensitive to outliers
Central Location: Sample Mean
•
•
•
•
(Sum of all observations) / (sample size)
Center of gravity of the distribution
depends on each observation
therefore sensitive to outliers
Central Location: Sample Mean
•
•
•
•
(Sum of all observations) / (sample size)
Center of gravity of the distribution
depends on each observation
therefore sensitive to outliers
Central Location: Sample Mean
•
•
•
•
(Sum of all observations) / (sample size)
Center of gravity of the distribution
depends on each observation
therefore sensitive to outliers
Central Location: Sample Mean
•
•
•
•
(Sum of all observations) / (sample size)
Center of gravity of the distribution
depends on each observation
therefore sensitive to outliers
Central Location: Sample Mean
•
•
•
•
(Sum of all observations) / (sample size)
Center of gravity of the distribution
depends on each observation
therefore sensitive to outliers
Central Location: Sample Median
•
•
Center of the ordered array
I.e., the (½)(n + 1) observation in the ordered array.
If sample size n is odd, then the
If sample size n is even, then the
median is the middle value in the
median is the average of the two
ordered array.
middle values in the ordered array.
Example A:
Example B:
1, 1, 0, 2 , 3
Order:
1, 1, 0, 2, 3, 6
Order:
0, 1, 1, 2, 3
0, 1, 1, 2, 3, 6
n = 5, odd
n = 6, even,
(½)(n + 1) = 3
(½)(n + 1) = 3.5
Median = 1
Median = (1 + 2)/2 = 1.5
Central Location: Sample Median
• Center of the ordered array
• depends on the magnitude of the central
observations only
• therefore NOT sensitive to outliers
Central Location: Sample Median
• Center of the ordered array
• depends on the magnitude of the central
observations only
• therefore NOT sensitive to outliers
Central Location: Sample Median
• Center of the ordered array
• depends on the magnitude of the central
observations only
• therefore NOT sensitive to outliers
Central Location: Sample Median
• Center of the ordered array
• depends on the magnitude of the central
observations only
• therefore NOT sensitive to outliers
Central Location: Sample Median
• Center of the ordered array
• depends on the magnitude of the central
observations only
• therefore NOT sensitive to outliers
Central Location: Sample Median
• Center of the ordered array
• depends on the magnitude of the central
observations only
• therefore NOT sensitive to outliers
Central Location: Mean vs. Median
•
•
•
•
Mean is influenced by outliers
Median is robust against (resistant to) outliers
Mean “moves” toward outliers
Median represents bulk of observations almost
always
Comparison of mean and median tells us about outliers
Dispersion
• Range
• Standard Deviation
• Inter-quartile Range
Dispersion: Range
•
•
•
•
•
Maximum - Minimum
Easy to calculate
Easy to interpret
Depends on sample size (biased)
Therefore not good for statistical
inference
Dispersion: Standard Deviation
0
1
2
SD = 1
0
 Y-Y 
5
2
-1
n 1
+1
1
-1
3
0
-2
SD = 2
5
+2
Dispersion: Properties of SD
• SD > 0 for all data
• SD = 0 if and only if all observations the
same (no variation)
• For a normal distribution,
– 68% expected within  1  SD,
– 95% expected within  2  SD,
– 99.6% expected within  3  SD,
• For any distribution, nearly all
observations lie within  3  SD
Interpretation of SD
n = 200
Mean = 7.6
Median = 7.6
6.5
7
7.5
8
8.5
9
SD = 0.41
Quantiles, Five Number Summary,
Boxplot
Maximum
Median
Minimum
4th quartile
100th percentile
1.00 quantile
3rd quartile
75th percentile
0.75 quantile
2nd quartile
50th percentile
0.50 quantile
1st quartile
25th percentile
0.25 quantile
0th quartile
0th percentile
0.00 quantile
Quantile Location and Quantiles
Example:
Value
0,
− 3.1, − 0.4, 0,
2.2, 5.1, 3.8, 3.8, 3.9, 2.3,
n = 10
Rank
5.1
10
3.9
9
3.8
8
3.8
7
2.3
6
2.2
5
0
4
0
3
−0.4
2
−3.1
1
Quantile
Rank
Quantile Location
Quartile
3.8  3.9
 3.85
2
0.75 = 3/4
 0.75 n  1  8.3
Q3 
0.50 = 2/4
 0.50  n  1  5.5
Q2 
2.2  2.3
 2.25
2
0.25 = 1/4
 0.25 n  1  2.8
Q1 
0.4  0
 0.2
2
Minimum = −3.1
5-Number Summary and Boxplot
Min
Q1
Q2
Q3
Max
−3.10
−0.20
2.25
3.85
5.10
Median  Q2  2.25
IQR  Q3  Q1  3.85   0.20   4.05
Range  Max  Min  5.10   3.10   8.20
Dispersion: IQR
Inter-Quartile Range
• (3rd Quartile - (1st Quartile)
• Robust against outliers
Interpretation of IQR
n = 200
Mean = 7.6
Median = 7.6
SD = 0.41
6.5
7
7.5
8
8.5
9
IQR = 0.54
For a Normal distribution, Median  2IQR includes 99.3%
Shape: Symmetry and Skewness
• Symmetry mean
bilateral symmetry
Shape: Symmetry and Skewness
• Symmetry mean
bilateral symmetry
• Positive Skewness (asymmetric
“tail” in positive direction)
Shape: Symmetry and Skewness
• “Symmetry” mean bilateral
symmetry, skewness = 0
• Mean = Median (approximately)
• Positive Skewness (asymmetric
“tail” in positive direction)
• Mean > Median
• Negative Skewness (asymmetric
“tail” in negative direction)
• Mean < Median
Comparison of mean and median tells us about shape
Bear Creek below Town of
Wise STP
9
8.5
8
7.5
6.5
7
7.5
8
8.5
9
7
6.5
Outlier Box Plot
Outliers
9
8.5
8
7.5
7
Whisker
75th %-tile = 3rd Quartile
Median
25th %-tile = 1st Quartile
Whisker
6.5
IQR
Wise, VA, below STP
8.5
11
10
8
7.5
TKN mg/l
13
pH
9
8
6
4
7
6.5
2
0
Wise, VA below STP
110
100
20
90
80
70
60
50
40
30
20
10
BOD (mg/l)
25
DO (% satur)
130
120
15
10
5
0
Wise, VA below STP
60000
Tot Phosphorous (mg/l
5
50000
4
3
2
40000
30000
20000
10000
1
0
0
Fecal
Coliforms