Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Concepts in Probability, Statistics and Stochastic Modeling • Loucks et al., 2005, Chapter 7 Learning Objective • Be able to use probability and statistics to quantify uncertainty and natural variability in physical quantities How Express a Distribution Cumulative Density Probability Density Which method conveys the information best to you? Probability Plot Equation Carl Friedrich Gauß, immortalized A random variable X is a variable whose outcomes (values) are governed by the laws of chance. 0.30 Probability density function 0.20 0.10 x1 0.00 f (x)dx f(x) P( x1 X x 2 ) x2 0 2 4 6 x 8 10 12 Cumulative distribution function f (x )dx 0.4 F(x) dF f (x) dx 0.8 0.0 F( x ) P( X x ) x2 0 2 4 6 x 8 10 12 Continuous and Discrete Random Variables From: Loucks, D. P., E. van Beek, J. R. Stedinger, J. P. M. Dijkman and M. T. Villars, (2005), Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Paris, 676 p, http://hdl.handle.net/1813/2804 0.8 0.4 F(X) 0.0 0.4 F(x) F(U) 0.0 F(u) 0.8 Generating a random variable from a given distribution 0.0 0.4 U 0.8 0 2 u 1. 2. X4 6 8 10 12 x Generate U from a uniform distribution between 0 and 1 Solve for X=F-1(U) F-1(U) is randomly distributed with CDF F(x) Basis P(X<x)=P(U<F(x))=P(F-1(U)<x) Generating a Pseudo random number • There is a lot of lore about this. Refer to: Press, W. H., B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, (1988), Numerical Recipes in C : The Art of Scientific Computing, Cambridge University Press, New York, 735 p. • Congruential method rnext remainder of [( rprev a c) m] • Each r is an integer random number between 0 and m-1. by (m-1) gives a number between 0 and 1 that repeats after at most m numbers. Numerical recipes gives "good" choices for a, c and m. • R has built in functions runif to generate uniform random numbers, as well as other distributions, e.g rnorm, rgamma. Moments of Random Variables Moments of Random Variables Population Sample Mean 1 N X Xi N i 1 xf ( x )dx Expectation 1 N Ê( X ) Xi N i 1 xf ( x )dx E( X ) Expectation operator E(g( X)) 1 N Ê( g( X )) g( X i ) N i 1 g(x )f (x )dx ( x ) 2 Variance 2 N 1 S ( X i X )2 N ( 1) i 1 f ( x )dx 2 E([ X E( X )] 2 ) Skewness 1 3 ( x ) 3 f ( x )dx 3 E([ X E( X )] ) / 3 ˆ 1 N (X i X) 3 N i 1 S3 L-Moments 2 1 / 2E[X(2|2) X(1|2) ] Probability weighted moments L-moment estimators L-Moment Diagrams From: Loucks, D. P., E. van Beek, J. R. Stedinger, J. P. M. Dijkman and M. T. Villars, (2005), Water Resources Systems Planning and Management: An Introduction to Methods, Models and Applications, UNESCO, Paris, 676 p, http://hdl.handle.net/1813/2804 From: Salas, J. D., J. W. Delleur, V. Yevjevich and W. L. Lane, (1980), Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colorado, 484 p. From: Salas, J. D., J. W. Delleur, V. Yevjevich and W. L. Lane, (1980), Applied Modeling of Hydrologic Time Series, Water Resources Publications, Littleton, Colorado, 484 p. Hillsborough River at Zephyr Hills, September flows 0.00010 x = 8621 mgal S = 8194 mgal n = 31 0.00000 Density 0.00020 Fitting a probability distribution to data 0 5000 10000 15000 mgal 20000 25000 30000 35000 Method of Moments • Using the sample moments as the estimate for the population parameters 2 ˆ ˆ E ( X ) x ; Var ( X ) 0.00020 Method of Moments Gamma distribution x 1e x f (x) () 2 0.00010 ˆ ˆ =1.3 x 10-3 x 0.00000 Density ˆ x =1.1 S 0 5000 10000 15000 20000 25000 30000 35000 0.00020 Method of Moments Log-Normal distribution f (x) 0.00010 S x ˆ 2y ln( CV2 1) =0.643 1 2 ˆ y ln( x exp( ˆ y )) =8.29 2 0.00000 Density CV 2 1 1 ln( x ) y exp y 2 y x 2 0 5000 10000 15000 20000 25000 30000 35000 Method of Maximum Likelihood • “Back into” the estimate by assuming the parameters we are trying to estimate from the data are known. • How likely are the sample values we have, given a certain set of parameter values? • We can express this as the joint density of the random sample given the parameter value. f X 1 , X 2 ,..., Xn x1 , x2 ,..., xn | f X xi | • After we obtain the data (random sample), we use the joint density to define the Likelihood function. n L | x1 , x2 ,..., xn f X xi | i 1 0.00020 Likelihood L fX xi | 0.00010 ln(L)= -312 (for log normal) 0.00000 Density ln(L)= -311 (for gamma) 0 5000 10000 15000 20000 25000 30000 35000 Normalization • Much theory relies on the central limit theorem so applies to Normal Distributions • Where the data is not normally distributed normalizing transformations are used – Log – Box Cox (Log is a special case of Box Cox) Box-Cox Normalization The Box-Cox family of transformations that includes the logarithmic transformation as a special case (=0). It is defined as: z = (x -1)/ ; 0 z = ln(x); = 0 where z is the transformed data, x is the original data and is the transformation parameter. Box-Cox Normalization So… the log looked OK ( = 0). Is that what we really want? Let’s skip the derivations for now and look at the answer for our three proposed methods. Determining Transformation Parameters • Trial and error: apply a series of trial lambda values and evaluate statistic. • PPCC (Filliben’s Statistic): R2 of best fit line of the QQplot • Kolomgorov-Smirnov (KS) Test (any distribution): p-value • Shapiro-Wilks Test for Normality: p-value Quantiles Rank the data pi 0.6 i n 1 0.2 prob( X x i ) F(y) x1 x2 x3 . . . xn Theoretical distribution, e.g. Standard Normal -3 -2 -1 0 qi1 2 3 y qi is the distribution specific theoretical quantile associated with ranked data value xi Quantile-Quantile Plots 7 6 5 3 4 Sample Quantiles 3000 2000 1000 0 Sample Quantiles xi ln(xi) 8 Normal Q-Q Plot QQ-plot for Log-Transformed Flows 4000 Normal QQ-plot for Q-Q RawPlot Flows -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 Theoretical Quantiles Theoretical Quantiles qi qi Need transformation to make the Raw flows Normally distributed. 2 3 Example: Determining Transformation Parameters • Alafia River historical monthly flows • Evaluate using all three criteria • Test a range of lambda values from -2 to 2 by 0.1 for Filliben’s and KS • Test a range of lambda values from -1 to 1 by 0.1 for Shapiro-Wilks (errors for larger lambda values). Box-Cox Normality Plot for Monthly September Flows on Alafia R. Using PPCC 0.2 0.4 0.6 This is close to 0, = -0.14 0.0 Fillibens Statistic 0.8 1.0 Box-Cox Normality Plot for Alafia R. -2 -1 0 Box-Cox Lambda Value Optimal Lambda= -0.14 1 2 Kolmogorov-Smirnov Test • Specifically, it computes the largest difference between the target CDF FX(x) and the observed CDF, F*(X). • The test statistic D2 is: n D2 max F * ( X (i ) ) FX ( X (i ) ) i 1 i (i ) max FX ( X ) i 1 n n where X(i) is the ith largest observed value in the random sample of size n. Box-Cox Normality Plot for Monthly September Flows on Alafia R. 1.0 Box-Cox Normality Plot for (KS) Alafia R.Statistic Using Kolmogorov-Smirnov 0.2 0.4 0.6 = -0.39 0.0 KS p-value 0.8 This is not as close to 0, -2 -1 0 Box-Cox Lambda Value Optimal Lambda= -0.39 1 2 http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/wilkshap.htm Box-Cox Normality Plot for Monthly September Flows on Alafia R. 0.2 0.4 0.6 This is close to 0, = -0.14. Same as PPCC. 0.0 Shapiro-Wilks p-value 0.8 1.0 Box-Cox Normality Plot for Alafia R. Using Shapiro-Wilks Statistic -1.0 -0.5 0.0 Box-Cox Lambda Value Optimal Lambda= -0.14 0.5 1.0