Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Answer to #1 L i K n BD foc Koc source-receptor distance hydraulic gradient hydraulic conductivity effective soil porosity soil bulk density fraction organic carbon organic partition coefficient Porosity and conductivity would likely be positively related Porosity and density would likely be inversely related Thus, conductivity and density should be inversely related too Fraction organic carbon and the partition coefficient would likely be positively related Answer to #2 L = [80,120] m i = [0.0003,0.0008] m per m K = [300,3000] m per yr n = [0.2,0.35] BD = [1500,1750] kg per m3 foc = [0.0001,0.005] Koc = [5 ,20] m3 per kg Kd = foc * Koc R = 1 + BD * Kd / n V = K * i / (n * R) T = L/V T [ 20.952380952, 408800] yr // source-receptor distance // hydraulic gradient // hydraulic conductivity // effective soil porosity // soil bulk density // fraction organic carbon // organic partition coefficient func TT() return L/(K * i / ($1 * (1 + BD * foc * Koc / $1))) b = left(n) e = TT(b) w = width(n)/100 for k=1 to 100 do begin s = [ (k-1), k] * w + b e = env(e, TT(s)) end e [ 31.480562448, 235352] yr Answer to #3 Iwater = [1.5, 2.5] liters per day Ifish = [0, 8] g per day B = [0.9, 2.1] liters per g W = [60, 90] kg // water intake // dietary ingestion of fish tissue // bioaccumulation factor // receptor biomass D = [0, 6] mg per kg per day // tolerable dose // forward equation // D = (Iwater * C) / W + (Ifish * B * C) / W // condense repeated parameters // D = C * (Iwater + Ifish * B) / W // solve for C by factoring the quotient (Iwater+Ifish*B)/W out of D C = factor((Iwater + Ifish * B) / W, D) C [ 0, 18.652849741] mg liters1 // put C back into original forward equation (and ensure correct units) d = (Iwater * C) / W + (Ifish * C * B) / W + 0 mg per kg per day d [ 0, 6] mg kg1 day1 Answer to #5 3.5105 3.499105, 3.504105] [3.50105, 1.35104] [3105, 1.4104] [2.9105, 4.1105] [ 2.5105, 1.905 104] 105 Vesely et al. (all independent) Mixed dependencies Frank to Fréchet All Fréchet Mixed dependence with intervals All Fréchet 104 Probability of tank rupturing under pumping 103 Plan of attack for #6 • Use the events from the slide “Snakes on a plane” • Write as Boolean expression – Represent ANDs as conjunctions & – Represent ORs as disjunctions V (plane V ship V raft V pet V zoo) & (pregnant V (malefemale & repro) V (female & parthenog)) & (dogs & cats & pigs) & (eggs V rodents) & (hides V quick V disperse) • Specify the temporal context, e.g., a year • Estimate marginal event probabilities – Maybe as submodels, e.g., plane = flights * stowaway per flight • Decide which events are independent Answer to #7 Assuming a = 0.29, b = 0.22, and r = 1.0, yields a * b + r * sqrt(a*(1-a)*b*(1-b)) 0.2517692528 Pearson AND [ max(0, a+b–1), min(a, b) ] [ 0, 0.22] Fréchet AND The probability assuming Pearson correlation r =1 is larger than the largest possible probability. This happens because the Pearson correlation for two events with these marginals cannot be as large as one (so we are making a false assumption which leads to the error). The extreme Pearson correlations for them are (-a*b)/sqrt(a*b*(1-a)*(1-b)) // when they don’t overlap -0.33941721344 (b-a*b)/sqrt(a*b*(1-a)*(1-b)) // when b’s inside a 0.83098697084 We can check this with this calculation r = 0.83098697084 a * b + r * sqrt(a*(1-a)*b*(1-b)) 0.22 Probability and Monte Carlo simulation Scott Ferson, [email protected] 18 September 2007, Stony Brook University, MAR 550, Challenger 165 Outline • • • • • Probability Ensembles and distributions Pseudorandom numbers and sampling Convolutions via Monte Carlo simulation Selecting distributions – – – • • Dependence Other issues – – • Multiple instantiation Model uncertainty Number of replications to use – – – • • Default distributions Fitting distributions (MoM, ML, regression) Empirical distributions Confidence interval for percentile Confidence interval for whole distribution Latin hypercube sampling to accelerate simulation Summary of Monte Carlo: steps, limits and pitfalls Related methods – – Second-order Monte Carlo Probability bounds analysis Traditional practice • Most assessments are deterministic • Some are deliberately conservative, but – degree of conservatism is opaque, unquantified, and can be inconsistent – difficult to characterize risks, except in extreme situations What’s needed An assessment should tell us • How likely the various consequences are • How reliable the conclusions are Probabilistic models • The same as deterministic models except that point values are replaced by probability distributions • Well, almost • • • • Ensembles Distribution shapes and tails Dependencies Backcalculation What is probability? • The frequentist view is that the probability of an event is its frequency given a long sequence of independent trials • The subjectivist view is that probability of an event is the degree of belief that a person has that the event will occur • Don’t confuse or mix these ideas Bayesian probability • Asserting A means you’ll pay $1 if not A • If the probability of A is P, then a Bayesian agrees to assert A for a fee of $(1-P), and assert not-A for a fee of $P • For Bayesians, belief implies action: bets • People will have different Ps for an A Frequentist probabilty • Let NA be the number of times event A occurs out of N trials • The probability of A is NA/N, as N • If the probability of event A is P(A) then event A will occur NP(A) times, on average, out of every N times Probability • Several ways to express how often the same value appears – Counts (most concrete conception) – Relative frequencies (finite populations) – Probabilities (even more abstract) • Probability distributions used to model both variability and incertitude Probability distributions • Distribution: bunch of values (ignoring order) • Random distribution: order unpredictable • Stationary distribution: unchanging shape Probability distribution display p(x) • Density distribution or mass distributions x • Cumulative distribution function (CDF, integral of density) 1 Pr(x<X) 0 • Complementary CDF (exceedance risk) x 1 Pr(x>X) 0 x Continuous distributions normal 1 uniform 0.8 0.6 exponential 0.4 lognormal 0.2 triangular 0 Weibull 0 beta Laplace many, many more Cumulative probability • • • • • • • • • 2 4 6 8 Values less than 0, or larger than 6, are impossible. Most values are around 3. 80% of values are less than 4. • • • • binominal Poisson discrete uniform reciprocal Cumulative probability Discrete distributions 1 0.8 0.6 0.4 0.2 0 0 2 4 Distributions can also be “mixed” (neither continuous nor discrete) 6 8 Notation and language • A “deviate” or “realization” is a particular real value of a random variable or a value drawn from a distribution • U(a, b) denotes a uniform distribution with minimum a and maximum b, or sometimes a deviate from it • N(, ) denotes a normal distribution with mean and standard deviation , or a deviate from it • A tilda ~ is read “is distributed as” • A distribution that describes the variation of a random variable is sometimes called its “law”, so if X ~ N(0,1) the law for X is normal Ensemble • Statisticians call it the “population” or “reference class” • E.g., dysfunction rates in prostate patients • 50% of attempts at sex fail, or • 50% of men are totally impotent • If the analyst can't tell you what ensemble the distribution represents, the analysis is probably meaningless Computational methods • Analytical approaches – Laplace and Mellin transforms – Delta method (just means and variances) • Discrete probability distributions • Monte Carlo simulation – Sampling values from input distributions – Computing function of inputs – Accumulating output histograms Pseudorandom numbers Sampling from a uniform • Computerized eenie-meanie-miney-moe • • • • Neumann’s middle-square method Linear congruential generators Mersenne twister (bulkier but faster and better) Blum-Blum-Shub • Smart to repeat simulations with different generators • The quantity a + (b − a)u varies uniformly over [a, b] if u varies uniformly over [0,1] Sampling continuous distributions 0.8 0.6 0.4 0.2 0 50 100 Value of random v ariable 0 150 Cumulative probability uniform deviate 1 Sampling discrete distributions 0.8 0.6 0.4 0.2 0 50 100 Value of random variable 0 150 Cumulative probability uniform deviate 1 Generating random deviates • Inverse transform sampling can make realizations from an arbitrary distribution using uniform deviates • Acceptance-rejection method can be used when the CDF has no closed form • Many distribution shapes have special algorithms that are faster or better Box-Muller algorithm N(0,1) ~ 2 ln(u) sin(2 v) where u, v ~ U(0,1), and u v • Specialized generator for normal deviates • Faster than inverting the cumulative normal distribution which has no closed form • General deviates N(, ) ~ N(0,1) + Rough and ready normals • Add 12 independent uniforms to 6 • Algorithm easy to remember • Not so great at extreme tails – Bounded inside [6, +6] Evans et al. describe many distributions and algorithms to generate samples from them Convolution (arithmetic on distributions) Convolution + X = Y X+Y Convolution is getting the output distribution from the input distributions. Monte Carlo is one way to do this. Integrated calculations X Sample realizations from each input distribution, then combine them all to compute a (scalar) output value. Repeat many times to accumulate a distribution of output values. Y W Z f(W,X,Y,Z) This is unlike interval analysis which solved operations sequentially Travel time (Lobascio) n BD foc Koc L T K i Parameter L source-receptor distance i hydraulic gradient K hydraulic conductivity n effective soil porosity BD soil bulk density foc fraction organic carbon Koc organic partition coefficient Units m m/m m/yr kg/m3 m3/kg Min 80 0.0003 300 0.2 1500 0.0001 5 Max 120 0.0008 3000 0.35 1750 0.005 20 Mean 100 0.00055 1000 0.25 1650 0.00255 10 Stdv 11.55 0.0001443 750 0.05 100 0.001415 3 Shape uniform uniform lognormal lognormal lognormal uniform normal Implementation in R rlognorm < function(many,a, b) return(rlnorm(many,log(a^2/sqrt(a^2+b^2)),sqrt(log((a^2+b^2)/a^2)))) many < 2000 L < runif(many, 80, 120) i < runif(many, 0.0003 ,0.0008) K < pmax(300, pmin(3000, rlognorm(many, 1000, 750))) n < pmax(0.2, pmin(0.35, rlognorm(many, 0.25, 0.05))) BD < pmax(1500, pmin(1750, rlognorm(many, 1650, 100))) foc < runif(many, 0.0001, 0.005) Koc < pmax(5, pmin(20, rnorm(many, 10, 3))) T < (n + BD * foc * Koc) * L / (K * i) summary(T) Min. 1st Qu. 100.4 3853.0 Median 8111.0 Mean 12460.0 3rd Qu. 16950.0 < is assignment rdist functions rlnorm’s arguments pmin and pmax Max. 108800.0 plot(sort(T), 1:many/many, t='s', xlab='T', ylab='Probability') D = T[T<500] plot(sort(D), 1:length(D)/many, t='s', xlab='T', ylab='Probability', ylim=c(0,1)) length(D)/many 0.0195 Result of the simulation 1 Probability 0.8 0.6 0.4 0.2 0 0 50,000 Travel time [years] 100,000 Interpreting the result • What is the minimum time? Maximum? » 100 years, 109,000 years • What happens when reps are increased? » Smaller min, larger max, approaching interval range • Why is the maximum time so much smaller than the 234,000 years? » Distribution highly skewed • If the random variables weren’t truncated, what would the min and max be? » Negative travel times (!), larger max time Detail 1 Probability 0.8 0.6 0.4 0.2 2% 0 0 100 200 300 400 Travel time [years] 500 Selecting input distributions How to specify distributions • • • • • Default distribution Expert opinion Fitted distribution (goodness of fit) Empirical distribution Maximum entropy Default distributions • • • • • • Physicists overuse normal distributions Risk analysts overuse uniform distributions Reliability engineers overuse Weibulls Ecologists overuse lognormals Statisticians overuse beta distributions Food safety overuses triangular distributions Cooper’s quip Arguments (frequently abused) • Sum of many independent comparable terms (normal) • Product of many independent factors (lognormal) • Largest of many values (Gumbel, Weibull, Fréchet) • Lifetimes with neither aging nor burn-in (exponential) • Symmetry (uniform) • Rare discrete events (Poisson) • Independent trials (binomial) Fitting distributions • Assumes you know the distribution family • Different methods give different results – Method of moments – Maximum likelihood – Regression methods Method of matching moments • Pick the distribution family – Mechanistic argument or mathematical convenience • Compute moments of the data • Select distribution from family with same moments – E.g., exponential (1), normal (2), triangular (3) • Other issues – Check the goodness of fit of the answer – Software can check many shapes at once – MoM not always available, lacks optimality properties Example 1 Second moment x 2 = 0.2334 n normal(0.439, 0.2334) sd 0.2 1 Cumulative probability 2 0.8 Probability density Data 0.653 0.178 0.263 0.424 0.284 0.438 0.471 0.852 0.480 0.375 0.148 0.185 0.320 0.642 0.247 0.784 0.643 0.261 0.636 0.487 1 Mean x = 0.439 n 0.6 1 0.4 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Maximum likelihood (ML) • Find the distribution that maximizes the likelihood of having observed the data • Likelihood is the probability of seeing the data assuming the distribution actually has some form datum Maximum likelihood in practice • When data are sampled independently, just multiply the likelihoods • Formulas for many shapes known • But they can sometimes be scary • Maximum likelihood agrees with MoM for some distributions (e.g., normal and exponential) Graphical/regression methods • Rank the data • Transform ranks by probability distribution • Regress data against the transformed ranks x 10 8 6 4 2 0 - 1.5 The slope and intercept give the parameters of the fitted distribution -1 - 0.5 0 0.5 1 1.5 Normal score of datum rank Goodness of fit (GoF) • Goodness of fit measures – Chi squared – Anderson-Darling (focuses on tails) – Kolmogorov-Smirnov (largest vertical difference) – Specialized tests for particular families (N,L : Shapiro-Wilks, Shapiro-Frances, D’Agostino, Lillefors) • EPA says it’s less important than sensibleness – Recommend bypassing a better fitting distribution for one that is plausible under some mechanism Empirical distributions • Distribution characterized by observed data F(x) = 1/N #(Xi < x) • Assumes data were randomly collected • Doesn’t need a shape assumption Data 0.653 0.178 0.263 0.424 0.284 0.438 0.471 0.852 0.480 0.375 0.148 0.185 0.320 0.642 0.247 0.784 0.643 0.261 0.636 0.487 Empirical cumulative probability Example 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Value of random v ariable 1 1.2 Drawbacks of EDFs • Need a fair amount of data to yield reliable estimates of distributions • Tails are likely to be poorly modeled • Data usually underestimate the range • Depends on “plotting position” Surrogacy • Data from a “related” species • Data on spatial variation used for temporal • QSAR, TEFs, allometries, regressions, etc. • Analyst should model this extrapolation • But it’s often not clear how to fix values Maximum entropy • Information theory n pi ln pi i 1 • All possibilities are equiprobable • Laplace's “principle of insufficient reason” • Specified by what is known • Mathematically more defensible than arbitrary assignment of distributions • Much cheaper than expert elicitation Maximum entropy solutions When you know {minimum, maximum} {mean, standard deviation} {minimum, maximum, mode} {minimum, mean} {min, max, mean, stddev} {minimum = 0, some quantile} {minimum > 0, some quantile} {minimum, maximum, mean} {mean, geometric mean} Use this shape uniform normal beta exponential beta exponential gamma beta gamma 1 1 1 mean, std min,max 0 0 2 4 6 8 10 0 -10 1 0 10 20 1 0 4 6 8 10 4 6 8 30 40 2 4 min,max, mean,std 0 20 50 40 50 6 8 10 1 min, mean 0 30 sample data, range 10 1 10 20 0 2 1 0 10 min, max, mean 0 2 0 1 integral, min,max Cumulative probability positive, quantile percentiles 0 2 4 6 8 10 2 4 6 8 10 Maximum entropy’s problem • Depends on the choice of scale • For instance, knowing the possible range for degradation rate yields one distribution, but knowing the possible range for half life yields a very different one even though the information is exactly the same How to specify distributions • Default distributions come right out of the book • Fitted or empirical distributions usually not enough data available • Extrapolations and surrogate data requires professional judgment • Elicitation from experts expensive, controversial when experts disagree • Maximum entropy criterion inconsistent through changes of scale All distributions should be reality-checked Example: mercury in wild mink Location: Bayou d’Inde, Louisiana Receptor: generic piscivorous small mammal Contaminant: mercury Exposure route: diet (fish and invertebrates) Based loosely on the assessment described in “Appendix I2: Assessment of Risks to Piscivorus [sic] Mammals in the Calcasieu Estuary”, Calcasieu Estuary Remedial Investigation/Feasibility Study (RI/FS): Baseline Ecological Risk Assessment (BERA), prepared October 2002 for the U.S. Environmental Protection Agency. See http://www.epa.gov/earth1r6/6sf/pdffiles/appendixi2.pdf. Total daily intake from diet Cfish | | Pfish Cinverts | | Pinverts TDI FMR | | AE GE AE | | GE fish fish inverts inverts FMR normalized free metabolic rate of the mink AEfish assimilation efficiency of dietary fish in the mink AEinverts assimilation efficiency of dietary invertebrates in the mink GEfish gross energy of fish tissue GEinverts gross energy of invertebrate tissue Cfish mercury concentration in fish tissue Cinverts mercury concentration in invertebrate tissue Pfish proportion of fish in the mink’s diet Pinverts proportion of invertebrates in the mink’s diet What’s known about the inputs FMR, Regression on mink body mass FMR = a BW b, where a and b are intervals BW, Empirically well studied Normal with known mean and dispersion AEfish, AEinverts, Some field data Mean and upper and lower values GEfish, GEinverts, Many measurements Normal with known mean and dispersion Cfish, Cinverts, Dictated by EPA policy 95% upper confidence limit on concentration Pfish, Pinverts, Assumed by the analyst Constants Input assignments BW ~ normal( 608, 66.9) gram a ~ uniform(0.354,0.47) b ~ uniform(0.836,0.888) FMR ~ a * BW ^ b AEfish ~ MEminmaxmean(0.77, 0.98, 0.91) AEinverts ~ MEminmaxmean(0.72, 0.96, 0.87) GEfish ~ normal(1200, 240) GEinverts~ normal(1050, 225) Cfish = 0.3 Cinverts = 0.06 Pfish = 0.9 Pinverts = 0.1 Kcal per kg per day Kcal per kg Kcal per kg mg per kg mg per kg Input distributions BW 400 0 600 800 FMR 0 100 200 GE2 C1 1000 2000 0.1 0.2 0.3 0.4 AE1 0.7 0.8 0.9 1 AE2 0.7 0.8 0.9 1 C2 0.02 0.04 0.06 GE1 P1 0.8 0.9 0 1000 2000 P2 1 0.1 0.12 Subscript 1 denotes fish, 2 denotes inverts Results 1 Exceedance risk mean median 95th percentile standard deviation 0.019 0.017 0.032 0.007 0 0 0.1 TDI, mg kg1 day1 0.2 Wrinkles • • • • • Number of replications Model uncertainty Backcalculation Multiple instantiations Dependence among variables Dependence Dependence • Most variables assumed independent • But some variables clearly aren’t – Density and porosity – Rainfall and temperature – Body size and skin surface area Dependence can be complex Y X Dependencies • • • • • • • • Independence (knowing X tells nothing about Y) Perfect dependence F(X) = G(Y) Opposite dependence F(X) = 1G(Y) Complete dependence Y = z(X) Linearly correlated Y = mX + b + Ranks linearly correlated G(Y) = mF(X) + b + Functional modeling Y = z(X) + Complex dependence (anything else!) Uncorrelatedness is not independence Perfect dependence uniform deviate 1 0.8 0.6 0.4 0.2 0 50 100 150 0 800 0 1600 Cumulative probability • May be better than assuming independence • Very easy to simulate Specified correlations Zi • Scheuer and Stoller (1962) • Pearson, not rank, correlation • Only for normal marginals i cij Y j , j 1 i 1,..., k σ11 σ1k CC T σ k1 σ kk Correlated deviates are weighted sums of independent standard normal deviates Yj ~ normal(0,1) where weights cij are the elements of a lower triangular matrix C solving = CCT (by Cholesky decomposition) where is the intended variance-covariance matrix (in which diagonals are variances off-diagonals are covariances) Bivariate case If Y1, Y2 ~ normal(0,1) are independent, then Z1 Y1σ1 μ1 Z 2 rY1 1 r Y2 σ 2 μ 2 2 will produce deviates such that Z1 ~ normal(1, 1) Z2 ~ normal(2, 2) corr(Z1, Z2) = r Modeling dependencies • Need to check that correlations are not infeasible (e.g., corr(A,B) = 0.9, corr(A,C)=0.9, corr(B,C)=0.9) • Monte Carlo simulations can model each of the special cases, but not complex dependence • Many dependence patterns yield the same correlation coefficient (most MC software arbitrarily selects one of these patterns) • It can be difficult to specify the dependence if empirical information is sparse But what if we don’t know r ? Cumulative probability 1 0.8 0.6 X ~ normal(5,1) Y ~ normal(10,2) various correlations 0.4 0.2 0 0 20 40 60 80 X Y 100 120 140 A ~ normal(2,1), B ~ lognormal(5,2) Cumulative probability A+B (1000 Trials with independence) A+B (1000 Trials with 0.7 correlation) 1.000 .750 .500 .250 .000 -5.00 0.00 5.00 10.00 15.00 Differences in tail risks could be much greater if the dependence between A and B is nonlinear, even if their correlation is very small. Each variable gets a single value • A variable can't be independent of itself, e.g., a multiple-route exposure model cairiair cwateriwater csoilisoil BW BW BW • All BW’s should be the same value within a single Monte Carlo replicate • Be wary of “libraries” of partial results Microevent modeling Monte Carlo simulation Simulate many anglers Simulate an angler Sample BW, ED Simulate ED years Simulate a year Sample EF Simulate EF meals Simulate a meal Sample Cfish, IR, LOSS slug = Cfish (1LOSS) IR CF acute = slug / BW total = total + slug chronic = total / (BW AT) Example: PCBs and duck hunters Location: Berkshire County, Massachusetts Receptor: Adult human hunters of waterfowl Contaminant: PCBs (polychorinated biphenyls) Exposure route: dietary consumption of contaminated waterfowl Loosely based on the assessment for non-cancer risks from PCB to adult hunters who consume contaminated waterfowl described in Human Health Risk Assessment: GE/Housatonic River Site: Rest of River, Volume IV, DCN: GE-031203-ABMP, April 2003, Weston Solutions (West Chester, Pennsylvania), Avatar Environmental (Exton, Pennsylvania), and Applied Biomathematics (Setauket, New York). Hazard quotient EF | | IR | | C | | 1 | | LOSS HQ AT | | BW | | RfD EF = EDF(n = 23) meals per year // exposure frequency, censored data IR = lognormal(188, 113) grams per meal // poultry ingestion from EPA’s EFH C = 9.73 mg per kg // exposure point (mean) concentration LOSS = 0 // loss due to cooking AT = 365.25 days per year // averaging time (not just units conversion) BW = mixture(BWfemale, BWmale) // Brainard and Burmaster (1992) BWmale = lognormal(171, 30) pounds // adult male n = 9,983 BWfemale = lognormal(145, 30) pounds // adult female n = 10,339 RfD = 0.00002 mg per kg per day // reference dose (EPA considers tolerable) Inputs 1 1 EF IR 0 0 0 10 20 30 40 meals per year 50 60 200 400 grams per meal 600 1 Exceedance probability 1 0 males females C BW 0 Always a point value, unless microevents are modeled 0 0 100 200 pounds 300 0 10 mg per kg 20 Results 1 HQ mean standard deviation median 95th percentile range 18.2 35.2 6.8 69 [0.01, 1385] 0 0 1000 2000 Backcalculation Backcalculation • Cleanup and remediation planning requires backcalculation • How can we untangle the expression A+B = C when we know A and C, and need B? Can’t just invert the equation conc intake dose = body mass dose body mass conc = intake 0.06 0.4 0.04 0.2 0.02 0.00 0 2 4 6 Planned dose 8 0.0 0 0.06 0.03 0.04 0.02 0.02 0.01 0.00 0 2 4 Intake 6 0.00 0 8 2 4 6 Body mass 8 2 4 6 Concentration 8 0.06 Large doses in the realized distribution (arising from the allowed distribution of concentrations) are more common than were originally planned 0.05 0.04 planned 0.03 0.02 realized 0.01 0 0 2 4 Dose 6 8 Normal approximation • If A+B=C, compute B as CA under the assumption that the correlation between A and C is r = sA/sC • Needs Pearson (not rank) correlation • Usually okay if multivariate normal Normal approximation with non-normal distributions 1 .0 0 .7 5 .5 0 .2 5 .0 1 .0 0 3 .0 0 5 .0 0 7 .0 0 9 .0 0 A=lognormal(5,1.5) 0 .0 0 4 .3 8 8 .7 5 13 .1 3 17 .5 0 C=triangular(0,12,15.5) 1 .0 0 realiz ed .7 5 .5 0 planned .2 5 .0 -5 .0 0 -0 .6 3 3 .7 5 8 .1 3 CA, with r =sA/sC 12 .5 0 0 5 10 C 15 20 Iterative (trial & error) approach • • • • • • Initialize B with CA This distribution is too wide Transform density p(x) to p(x)m Rescale so that area remains one Whenever m > 1 dispersion decreases Repeat until you get an acceptable fit 0.06 By trial and error, you may be able to find a distribution of concentrations that yields something close to the planned distribution of doses. This is not always possible however. 0.05 0.04 conc intake dose body mass 0.03 intake ln( conc ) ln( dose) ln body mass A B C 0.02 0.01 0 0 2 4 Dose 6 8 Model uncertainty Model uncertainty • Doubt about the structural form of the model model uncertainty Parameters Distribution shape Intervariable dependence Arithmetic equation Level of abstraction • Usually incertitude rather than variability • Usually considerable in ecosystems models • Often the elephant in the middle of the room Monte Carlo strategy • Introduce a new discrete variable • Let the value of the variable dictate which model will be used in each rep • Wiggle the value from rep to rep • Only works for short, explicit list of models (you have to list the models) • Many theorists object to this strategy Model uncertainty as a mixture PDF If u>0.5 then model=I else model=II II I 1 CDF I or II 0 How many replicates How many replications • • • • • • • More is always better Tails are especially hard to nail down Curse of dimensionality Latin hypercube sampling can help Repeat the simulation as a check best Confidence intervals on fractiles Kolmogorov-Smirnov limits on distributions 100 versus 10,000 replicates 1.000 0.02 100 Trials: close-up of left hand tail .750 .500 0.01 .250 0.00 .000 0.02 1.000 10,000 Trials: close-up of left hand tail .750 0.01 .500 .250 0.00 .000 0.00 125.00 250.00 375.00 Time to contamination (yr) 500.00 Confidence interval for a fractile The a100% confidence interval for the pth fractile can be estimated by [Yi, Yj ], where Yk is the (n k + 1)th largest value from the Monte Carlo simulation, i = floor(np b), j = ceiling(np + b), b = z1((1 a)/2)(np(1 p)) • Vary n to get the precision you desire • But remember this represents only sampling error, not measurement error Kolmogorov-Smirnov bounds Bounds on the distribution as a whole 100 replications replications 200 1000 replications replications 2000 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 95% of the time, the entire distribution will lie within the bounds 8 9 Latin hypercube sampling • Stratify the sampling of distributions – Divide each distribution into n equal-probability regions, where n = #reps – For each replicate, select one deviate from each distribution – Select each region only once • Often much better for moments, sometimes better for tails • May reduce the number of replicated needed Conclusions Contrast with interval analysis There are fundamentally two approaches to calculation under uncertainty: Bounding • Say where you know the answer isn’t • Tighten range Approximation • Say where you think the answer is • Minimize bias Monte Carlo simulation • How? – replace point estimates with distributions – repeatedly sample from each distribution and compute – tally answers in histogram • Why? – simple to implement – fairly simple to explain – summarizes entire distribution of risk • Why not? – requires a lot of empirical information (or guesses) – routine assumptions may be “non-protective” Why do probabilistic analysis? • • • • • The only way to get at likelihoods Enhances understanding of risks Improves decision making EPA guidance available NASA guidance in preparation Steps in a Monte Carlo PRA • Clarify the questions and gather data • Formulate the model (identify variables and their interrelationships) • Specify distributions for each input • Specify dependencies and correlations • Run simulation • Conduct sensitivity studies • Present results • Discuss limitations of the assessment Pitfalls • • • • • • • Muddled model Infeasible correlation matrices Inconsistent independence assumptions Multiply instantiated variables Too few replications No sensitivity studies Confounding of different populations Limitations of Monte Carlo • • • • • • • • Thought by some to be too difficult Needs precise, stationary distributions Nonlinear dependencies often ignored Unknown dependencies often ignored Assumes model is well understood Can be computationally expensive Backcalculations difficult Treats incertitude like variability Kinds of uncertainty • Variability – – – – Stochasticity Heterogeneity Spatial variation Individuality and genetics • Incertitude – Measurement imprecision – Ignorance – Model uncertainty • Vagueness, confusion, etc. Why distinguish them? • Empirical effort can reduce incertitude • Usually can't reduce variability • Managers need to distinguish the two kinds of uncertainty to plan remediation efforts effectively Two-dimensional simulation • • • • • Monte Carlo nested inside Monte Carlo Inner loop for variability Outer loop for uncertainty (incertitude) Squared number of replications Integrated sensitivity analysis Meta-distribution is the output Second-order tail risks 1 Exceedance risk (Complementary) cumulative version of the meta-distribution 0 0 Probability boxes • • • • Bounds about the cumulative distribution Rigorous (rather than approximate) Analytical (not from simulations) Can account for lack of knowledge about distribution shapes and dependencies 1 1 1 0 0 0 0 20 0 40 0 80 References Abbas, A.E. 2003. Entropy methods for univariate distributions in decision analysis. Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 22nd International Workshop, editied by C.J. Williams, American Institute of Physics. Cullen, A.C., and H.C. Frey. (1999). Probabilistic Techniques in Exposure Assessment: A Handbook for Dealing with Variability and Uncertainty in Models and Inputs. Plenum Press: New York. Evans, M., N. Hastings, B. Peacock. 2000. Statistical Distributions. John Wiley and Sons, New York. Ferson, S., R.B. Nelsen, J. Hajagos, D.J. Berleant, J. Zhang, W.T. Tucker, L.R. Ginzburg and W.L. Oberkampf. 2004. Dependence in probabilistic modeling, Dempster-Shafer theory, and probability bounds analysis. Sandia National Laboratories, Albuquerque, NM. Hansen, F. (1997). Policy for Use of Probabilistic Analysis in Risk Assessment at the U.S. Environmental Protection Agency. EPA/ORD/NCEA May 15:1-4. Available on-line at http://www.epa.gov/osp/spc/probpol.htm. Jaynes, E.T. (edited by G. Larry Bretthorst). 2003. Probability Theory: The Logic of Science. Cambridge University Press. Lee, R.C. and W.E. Wright. 1994. Development of human exposure-factor distributions using maximum-entropy inference. Journal of Exposure Analysis and Environmental Epidemiology 4:329341. Morgan, M.G. and M. Henrion. (1990). Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge University Press, Cambridge, UK. Scheuer, E.M. and D.S. Stoller. (1962). On the generation of normal random vectors. Technometrics 4:278-281. Solana, V. and N.C. Lind. 1990. Two principles for data based on probabilistic system analysis. Proceedings of ICOSSAR '89, 5th International Conferences on Structural Safety and Reliability. American Society of Civil Engineers, New York. Exercises for next time 1. How could you sample deviates from a normal and a lognormal distribution that have opposite dependence? 2. Why is the concentration generally modeled as a point value in PRA’s? Under what conditions should a distribution be used? 3. What are the implications of using a uniform distribution to model an interval range? 4. How could model uncertainty be characterized if we cannot list the possible models?