Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Delivering Integrated, Sustainable, Water Resources Solutions Institute for Water Resources 2010 Choosing a Probability Distribution Charles Yoe, Ph.D. “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Probability x Consequence • Quantitative risk assessment requires you to use probability • Sometimes you will estimate the probability of an event • Sometimes you will use distributions to – Describe data – Model variability – Represent our uncertainty • What distribution do you use? “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Probability—Language of Random Variables • Constant • Variables • Some things vary predictably • Some things vary unpredictably • Random variables • It can be something known but not known by us “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Checklist for Choosing a Distributions From Some Data 1. 2. Can you use your data? Understand your variable a) Source of data b) Continuous/discrete c) Bounded/unbounded d) Meaningful parameters a) e) Do you know them? (1st or 2nd order) Univariate/multivariate 3. 4. 5. 6. 7. 8. 9. “ Building Strong “ Look at your data— plot it Use theory Calculate statistics Use previous experience Distribution fitting Expert opinion Sensitivity analysis Delivering Integrated, Sustainable, Water Resources Solutions First! • Do you have data? • If so, do you need a distribution or can you just use your data? • Answer depends on the question(s) you’re trying to answer as well as your data “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Use Data • If your data are representative of the population germane to your problem use them • One problem could be bounding data – What are the true min & max? • Any dataset can be converted into a – Cumulative distribution function – General density function “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Fitting Empirical Distribution to Data • If continuous & reasonably extensive • May have to estimate minimum & maximum • Rank data x(i) in ascending order • Calculate the percentile for each value • Use data and percentiles to create cumulative distribution function “ Building Strong “ Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Data Cumulative Probability Value F(x) = i/19 0 0.9 3.6 5.0 6.0 11.7 16.2 16.5 22.2 22.7 23.2 24.5 24.9 25.8 33.3 33.4 34.7 40.2 44.2 60.0 0 0.053 0.105 0.158 0.211 0.263 0.316 0.368 0.421 0.474 0.526 0.579 0.632 0.684 0.737 0.789 0.842 0.895 0.947 1 Delivering Integrated, Sustainable, Water Resources Solutions When You Can’t Use Your Data • Given wide variety of distributions it is not always easy to select the most appropriate one – Results can be very sensitive to distribution choice • Using wrong assumption in a model can produce incorrect results=>poor decisions=> undesirable outcomes “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Understand Your Data • What is source of data? – – – – – – – Experiments Observation Surveys Computer databases Literature searches Simulations Test case Understand your variable The source of the data may affect your decision to use it or not. “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions •Barges in a tow •Houses in floodplain •People at a meeting •Results of a diagnostic test •Casualties per year •Relocations and acquisitions Type of Variable? •Average number of barges per tow •Weight of an adult striped bass •Sensitivity or specificity of a diagnostic test •Transit time •Expected annual damages •Duration of a storm •Shoreline eroded •Sediment loads • Is your variable discrete or continuous ? • Do not overlook this! – Discrete distributions- take one of a set of identifiable values, each of which has a calculable probability of occurrence – Continuous distributions- a variable that can take any value within a defined range Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions What Values Are Possible? • Is your variable bounded or unbounded? – Bounded-value confined to lie between two determined values – Unbounded-value theoretically extends from minus infinity to plus infinity – Partially bounded-constrained at one end (truncated distributions) • Use a distribution that matches Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Continuous Distribution Examples • Unbounded – Normal – t – Logistic • Left Bounded – – – – – Chi-square Exponential Gamma Lognormal Weibull Understand your variable • Bounded – – – – – – “ Building Strong “ Beta Cumulative General/histogram Pert Uniform Triangle Delivering Integrated, Sustainable, Water Resources Solutions Discrete Distribution Examples • Unbounded • Bounded – None • Left Bounded – Poisson – Negative binomial – Geometric Understand your variable – – – – “ Building Strong “ Binomial Hypergeometric Discrete Discrete Uniform Delivering Integrated, Sustainable, Water Resources Solutions Are There Parameters • Does your variable have parameters that are meaningful? – Parametric--shape is determined by the mathematics describing a conceptual probability model • Require a greater knowledge of the underlying – Non-parametric—empirical distributions for which the mathematics is defined by the shape required • Intuitively easy to understand • Flexible and therefore useful Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Choose Parametric Distribution If • Theory supports choice • Distribution proven accurate for modelling your specific variable (without theory) • Distribution matches any observed data well • Need distribution with tail extending beyond the observed minimum or maximum Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Choose Non-Parametric Distribution If • • • • Theory is lacking There is no commonly used model Data are severely limited Knowledge is limited to general beliefs and some evidence Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Parametric and Non-Parametric • • • • • • Normal Lognormal Exponential Poisson Binomial Gamma Understand your variable • • • • Uniform Pert Triangular Cumulative “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Do You Know the Parameters? • Probability distribution with precisely known parameters (N(100,10)) is called a 1st order distribution • Probability distribution with some uncertainty about its parameters (N(m,s)) is called a 2nd order distribution • Risknormal(risktriang(90,100,103),riskuniform(8,11)) Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Is It Dependent on Other Variables • Univariate and multivariate distributions – Univariate--describes a single parameter or variable that is not probabilistically linked to any other in the model – Multivariate--describe several parameters that are probabilistically linked in some way • Engineering relationships are often multivariate Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Continuing Checklist for Choosing a Distributions 3. 4. 5. 6. 7. 8. 9. Look at your data—plot it Use theory Calculate statistics Use previous experience Distribution fitting Expert opinion Sensitivity analysis “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Plot--Old Faithful Eruptions • What do your data look like? • You could calculate Mean & SD and assume its normal • Beware, danger lurks • Always plot your data “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Which Distribution? • Examine your plot • Look for distinctive shapes of specific distributions – – – – – Single peaks Symmetry Positive skew Negative values Gamma, Weibull, beta are useful and flexible forms “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Theory-Based Choice • Most compelling reason for choice • Formal theory – Central limit theorem • Theoretical knowledge of the variable – Behavior – Math—range • Informal theory – Sums normal, products lognormal – Study specific – Your best documented thoughts on subject “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Calculate Statistics • Summary statistics may provide clues • Normal – Low coefficient of variation – Equal mean and median • Exponential has positive skew – Equal mean and standard deviation • Consider outliers “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Outliers • Extreme observations can drastically influence a probability model • No prescriptive method for addressing them • If observation is an error remove it • If not what is data point telling you? – What about your world-view is inconsistent with this result? – Should you reconsider your perspective? – What possible explanations have you not yet considered? “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Outliers (cont) • Your explanation must be correct, not merely plausible – Consensus is poor measure of truth • If you must keep it and can't explain it – Use conventional practices and live with skewed consequences – Choose methods less sensitive to such extreme observations (Gumbel, Weibull) “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Previous Experience • Have you dealt with this issue successfully before? Have others? • What did other analyses or risk assessments use? • What does the literature reveal? “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Goodness of Fit • Provides statistical evidence to test hypothesis that your data could have come from a specific distribution • H0 these data come from an “x” distribution • Small test statistic and large p mean accept H0 • It is another piece of evidence not a determining factor “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions GOF Tests • Chi-Square Test – Most common— discrete & continuous – Data are divided into a number of cells, each cell with at least five – Usually 50 observations or more • Kolomogorov-Smirnov Test – More suitable for small samples than ChiSquare – Better fit for means than tails • Andersen-Darling Test – Weights differences between theoretical and empirical distributions at their tails greater than at their midranges – Desirable when better fit at extreme tails of distribution are desired “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Kolmogorov-Smirnov Statistic Normal(25.2290, 4.9645) 1.0 0.8 0.6 0.4 0.2 < 5.0% 90.0% 17.06 40 35 30 25 20 15 10 5 0.0 5.0% > • Blue = data • Red = true/hypothetical • Find biggest difference between the two • K-S statistic is largest difference consistent with your 33.39 –n –α “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Defining Distributions w/ Expert Opinion • • • • • Data never collected Data too expensive or impossible Past data irrelevant Opinion needed to fill holes in sparse data New area of inquiry, unique situation that never existed “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions What Experts Estimate • The distribution itself – Judgment about distribution of value in population – E.g. population is normal • Parameters of the distribution – E.g. mean is x and standard deviation is y “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Modeling Techniques • Disaggregation (Reduction) • Subjective Probability Elicitation • PDF or CDF • Parametric or Non-parametric distributions “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Elicitation Techniques Needed • Literature shows we do not assess subjective probabilities well • In part due to heuristics we use – Representativeness – Availability – Anchoring and adjustment • There are methods to counteract our heuristics and to elicit our expert knowledge “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Sensitivity Analysis • Unsure which is the best distribution? • Try several – If no difference you are free to use any one – Significant differences mean doing more work “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Take Away Points • Choosing the best distribution is where most new risk assessors feel least comfortable. • Choice of distribution matters. • Distributions come from data and expert opinion. • Distribution fitting should never be the basis for distribution choice. “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Questions? Charles Yoe, Ph.D. [email protected] “ Building Strong “