Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Application of Statistical Techniques to Interpretation of Water Monitoring Data Eric Smith, Golde Holtzman, and Carl Zipper Outline I. Water quality data: program design (CEZ, 15 min) II. Characteristics of water-quality data (CEZ, 15 min) III. Describing water quality(GIH, 30 min) IV. Data analysis for making decisions A, Compliance with numerical standards (EPS, 45 min) Dinner Break B, Locational / temporal comparisons (“cause and effect”) (EPS, 45) C, Detection of water-quality trends (GIH, 60 min) III. Describing water quality (GIH, 30 min) • Rivers and streams are an essential component of the biosphere • Rivers are alive • Life is characterized by variation • Statistics is the science of variation • Statistical Thinking/Statistical Perspective • Thinking in terms of variation • Thinking in terms of distribution The present problem is multivariate • • • • WATER QUALITY as a function of TIME, under the influence of co-variates like FLOW, at multiple LOCATIONS Water Variable WQ variable versus time Time in Years Bear Creek below Town of Wise STP 9 PH 8.5 8 7.5 7 6.5 1973/12/14 1978/12/14 1983/12/14 DATE 1988/12/14 1993/12/14 Water Quality Univariate WQ Variable Time Water Quality Univariate WQ Variable Water Quality Time Univariate Perspective, Real Data (pH below STP) 9 8.5 8 7.5 7 6.5 6.5 7 7.5 8 8.5 9 The three most important pieces of information in a sample: • Central Location – Mean, Median, Mode • Dispersion – Range, Standard Deviation, Inter Quartile Range • Shape – Symmetry, skewness, kurtosis Central Location: Sample Mean • • • • (Sum of all observations) / (sample size) Center of gravity of the distribution depends on each observation therefore sensitive to outliers Central Location: Sample Mean • • • • (Sum of all observations) / (sample size) Center of gravity of the distribution depends on each observation therefore sensitive to outliers Central Location: Sample Mean • • • • (Sum of all observations) / (sample size) Center of gravity of the distribution depends on each observation therefore sensitive to outliers Central Location: Sample Mean • • • • (Sum of all observations) / (sample size) Center of gravity of the distribution depends on each observation therefore sensitive to outliers Central Location: Sample Mean • • • • (Sum of all observations) / (sample size) Center of gravity of the distribution depends on each observation therefore sensitive to outliers Central Location: Sample Mean • • • • (Sum of all observations) / (sample size) Center of gravity of the distribution depends on each observation therefore sensitive to outliers Central Location: Sample Median • • Center of the ordered array I.e., the (½)(n + 1) observation in the ordered array. If sample size n is odd, then the If sample size n is even, then the median is the middle value in the median is the average of the two ordered array. middle values in the ordered array. Example A: Example B: 1, 1, 0, 2 , 3 Order: 1, 1, 0, 2, 3, 6 Order: 0, 1, 1, 2, 3 0, 1, 1, 2, 3, 6 n = 5, odd n = 6, even, (½)(n + 1) = 3 (½)(n + 1) = 3.5 Median = 1 Median = (1 + 2)/2 = 1.5 Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers Central Location: Sample Median • Center of the ordered array • depends on the magnitude of the central observations only • therefore NOT sensitive to outliers Central Location: Mean vs. Median • • • • Mean is influenced by outliers Median is robust against (resistant to) outliers Mean “moves” toward outliers Median represents bulk of observations almost always Comparison of mean and median tells us about outliers Dispersion • Range • Standard Deviation • Inter-quartile Range Dispersion: Range • • • • • Maximum - Minimum Easy to calculate Easy to interpret Depends on sample size (biased) Therefore not good for statistical inference Dispersion: Standard Deviation 0 1 2 SD = 1 0 Y-Y 5 2 -1 n 1 +1 1 -1 3 0 -2 SD = 2 5 +2 Dispersion: Properties of SD • SD > 0 for all data • SD = 0 if and only if all observations the same (no variation) • For a normal distribution, – 68% expected within 1 SD, – 95% expected within 2 SD, – 99.6% expected within 3 SD, • For any distribution, nearly all observations lie within 3 SD Interpretation of SD n = 200 Mean = 7.6 Median = 7.6 6.5 7 7.5 8 8.5 9 SD = 0.41 Quantiles, Five Number Summary, Boxplot Maximum Median Minimum 4th quartile 100th percentile 1.00 quantile 3rd quartile 75th percentile 0.75 quantile 2nd quartile 50th percentile 0.50 quantile 1st quartile 25th percentile 0.25 quantile 0th quartile 0th percentile 0.00 quantile Quantile Location and Quantiles Example: Value 0, − 3.1, − 0.4, 0, 2.2, 5.1, 3.8, 3.8, 3.9, 2.3, n = 10 Rank 5.1 10 3.9 9 3.8 8 3.8 7 2.3 6 2.2 5 0 4 0 3 −0.4 2 −3.1 1 Quantile Rank Quantile Location Quartile 3.8 3.9 3.85 2 0.75 = 3/4 0.75 n 1 8.3 Q3 0.50 = 2/4 0.50 n 1 5.5 Q2 2.2 2.3 2.25 2 0.25 = 1/4 0.25 n 1 2.8 Q1 0.4 0 0.2 2 Minimum = −3.1 5-Number Summary and Boxplot Min Q1 Q2 Q3 Max −3.10 −0.20 2.25 3.85 5.10 Median Q2 2.25 IQR Q3 Q1 3.85 0.20 4.05 Range Max Min 5.10 3.10 8.20 Dispersion: IQR Inter-Quartile Range • (3rd Quartile - (1st Quartile) • Robust against outliers Interpretation of IQR n = 200 Mean = 7.6 Median = 7.6 SD = 0.41 6.5 7 7.5 8 8.5 9 IQR = 0.54 For a Normal distribution, Median 2IQR includes 99.3% Shape: Symmetry and Skewness • Symmetry mean bilateral symmetry Shape: Symmetry and Skewness • Symmetry mean bilateral symmetry • Positive Skewness (asymmetric “tail” in positive direction) Shape: Symmetry and Skewness • “Symmetry” mean bilateral symmetry, skewness = 0 • Mean = Median (approximately) • Positive Skewness (asymmetric “tail” in positive direction) • Mean > Median • Negative Skewness (asymmetric “tail” in negative direction) • Mean < Median Comparison of mean and median tells us about shape Bear Creek below Town of Wise STP 9 8.5 8 7.5 6.5 7 7.5 8 8.5 9 7 6.5 Outlier Box Plot Outliers 9 8.5 8 7.5 7 Whisker 75th %-tile = 3rd Quartile Median 25th %-tile = 1st Quartile Whisker 6.5 IQR Wise, VA, below STP 8.5 11 10 8 7.5 TKN mg/l 13 pH 9 8 6 4 7 6.5 2 0 Wise, VA below STP 110 100 20 90 80 70 60 50 40 30 20 10 BOD (mg/l) 25 DO (% satur) 130 120 15 10 5 0 Wise, VA below STP 60000 Tot Phosphorous (mg/l 5 50000 4 3 2 40000 30000 20000 10000 1 0 0 Fecal Coliforms