* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download x - TU Delft: CiTG
Survey
Document related concepts
Transcript
Kansrekening en steekproeftheorie Pieter van Gelder TU Delft IVW-Cursus, 16 September 2003 De basis van de theorie der kansrekening als fundament voor de cursus; Schatten van verdelingsparameters; Steekproef theorie, waarbij zowel met als zonder voor-informatie wordt gewerkt (Bayesiaanse versus Klassieke steekproeven); Afhankelijkheden tussen variabelen en risico's. Inspection in Civil Engineering Stochastic variables Outline • • • • • • What is a stochastic variable? Probability distributions Fast characteristics Distribution types Two stochastic variables Closure Stochastic variable • Quantity that cannot be predicted exactly (uncertainty): – Natural variation – Shortage of statistical data – Schematizations Examples: – – – – Strength of concrete Water level above a tunnel Lifetime of a chisel Throw of a dice Relation to events • Express uncertainty in terms of probability • Probability theory related to events • Connect value of variable to event • E.g. probability that stochastic variable X – – – – – is less than x is greater than x is equal to x is in the interval [x, x+ x] etc. Probability distribution • Probability distribution function = probability P(Xx): • FX(x) = P(Xx) 1 stochast 0.6 X F (x) dummy 0.8 0.4 0.2 0 x Probability density • Familiar form probability ’distribution’: 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 x 2 3 4 5 • This is probability density function Probability density • Differentiation of F to x: • fX(x) = dFX(x) / dx • f = probability density function • fX(x) dx = P(x < X x+dx) 1 P(X x) 0.6 X F (x) 0.8 0.4 0.2 0 x 0.5 fX(x) 0.4 P(x < X x+d x) 0.3 0.2 0.1 0 x x+d x 1 0.6 X F (x) 0.8 0.4 0.2 0 P(X x) x 0.5 fX(x) 0.4 0.3 0.2 0.1 0 x Discrete and continuous discrete variable: 0.4 1 0.35 0.9 0.3 0.8 F (x) 0.7 X X p (x) 0.25 0.2 0.6 0.15 0.5 0.1 0.4 0.05 0 0 1 2 3 4 5 0.3 6 0 1 2 3 x x 4 5 6 1 0.4 0.8 F (x) 0.5 0.6 X 0.3 X f (x) continuous variable: 0.2 0.4 0.1 0.2 0 -4 -2 0 x 2 4 probability density 6 0 -4 -2 0 x 2 4 6 (cumulative) probability distribution Fast characteristics 0.5 0.3 X f (x) 0.4 0.2 sX 0.1 0 -4 mX sX -2 0 mX 2 4 x 6 mean, indication of location standard deviation, indication for spread Fast characteristics 0.7 0.6 0.5 fX(x) 0.4 0.3 0.2 sX 0.1 0 0 1 2 mX 3 4 Mean location maximum (mode) x 5 Fast characteristics m X x fX x ) dx • Mean • (centre of gravity) • Variance s X2 x m X ) fX x ) dx 2 • • Standard deviation • Coefficient of variation sX VX sX mX Normal distribution Normal distributions 1 fX(x) 0.8 0.6 sX 0.4 sX 0.2 0 -4 -2 0 mX 2 4 x 6 Completely determined by mean and standard deviation Normal distribution • Probability density function 1 f X x ) e s 2 1 x m 2 s 2 • Standard normally distributed variable • (often denoted by u): mu 0 su 1 Normal distribution • Why so popular? • Central limit theorem: • Sum of many variables with arbitrary distributions is (almost) normally distributed. • Convenient in structural reliability calculations Two stochastic variables joint probability density function Kansdichtheid 0.2 0.15 0.1 0.05 0 2 1 2 1 0 0 -1 y -1 -2 -2 x Contour map probability density 2 y 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 x 2 Two stochastic variables • Relation to events fXY x , ) dx d Px X x + dx en Y + d ) FXY x , ) PX x en Y ) 2 1.5 1 y 0.5 0 -0.5 d -1 dx -1.5 -2 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Example • Health survey. • Length kansdichtheid (1/m) • Measurements of: 3 2.5 2 1.5 1 0.5 0 1.2 1.4 1.6 1.8 2 2.2 lengte (m) • Weight kansdichtheid (1/kg) 0.05 0.04 0.03 0.02 0.01 0 50 60 70 80 gewicht (kg) 90 100 110 2.4 2.6 Logical contour map? 110 100 weight (kg) 90 80 70 60 50 1.4 1.6 1.8 length (m) 2 2.2 Dependency 110 100 weight (kg) 90 80 70 60 50 1.4 1.6 1.8 length (m) 2 2.2 Fast characteristics • Location: mX, mY means • Spread s X, s Y standard deviation • Dependency • covXY rXY = covXY / sX sY covariance correlation, between -1 and 1 Independent variables FXY x , ) FX x ) FY ) fXY x , ) fX x ) fY ) cov XY 0 r XY cov XY sX sY 0 Closure of the short Introduction to Stochastics • • • • • What is a stochastic variable? Probability distributions Fast characteristics Distribution types Two stochastic variables Parameter estimation methods • Given a dataset x1, x2, …, xn • Given a distribution type F(x|A,B,…) • How to estimate the unknown parameters A,B,… to the data? List of estimation methods • • • • MoM ML LS Bayes MoM • Distribution moments = Sample moments xnf(x)dx = xin F(x) = 1- exp[-(x-A)/B] AMOM = std(x) BMOM = mean(x) +std(x) Binomial distribution • X~Bin(N,p) • The binomial distribution gives the discrete probability distribution of obtaining exactly n successes out of N Bernoulli trials (where the result of each Bernoulli trial is true with probability p and false with probability q=1-p). The binomial distribution is therefore given by • fX(n) = E(X) = Np; var(X)=Npq MoM-estimator of p • pMOM = xi / N for j=1:M, x=0; for I=1:N, if rand(1)<p, x(I)=1; end end y(j)=sum(x); end for j=1:M, pMOM(j)=y(j)/N; end hist(pMOM) 300 250 200 Frequency • • • • • • • • • • • Performance of p-estimation (N=10; p=0.2) 150 100 50 0 0 0.1 0.2 0.3 0.4 p 0.5 0.6 0.7 Case Study • Webtraffic statistics – The number of pageviews on websites Statistics on Usage of Screen sizes • Is it necessary to download from every user his/her screen size? • Is it sufficient to inspect the screen size of just N users, and still have a reliable percentage of the used screen sizes? Assume 41% of the complete population uses size 1024x768 • Inspection population size N = 100, 1000, …and simulate the results by generating the usage from a Binomial distribution. • Theoretical analysis: Cov=sqrt(1/p - 1)N-1/2 Coefficient of variations (as a function of p and N) P N 100 1000 10 000 106 41.4% 11.75% 3.7% 1.2% 0.1% 39.8% 12.3% 3.9% 1.3% 0.1% 6.2% 38.9% 12.3% 3.9% 0.4% 5.4% 41.8% 13.2% 4.2% 0.4% 3.2% 55.0% 17.4% 5.5% 0.55% Optimisation of the inspection sample size • Assume the costs of getting screen size information from a user is A • Assume the costs of having a larger cov-value is B • TC(N) = A.N + B.sqrt(1/p - 1)N-1/2 • The optimal sample size follows from TC’(N) = 0, giving N* = B/2A.(1/p - 1)-2/3 • For this choice of N, the cov = (2A/B.(1/p – 1))1/3 Case study container inspectie • • • • Toelaatbare ‘ontglip kans’ p = 1/1.000 containers Populatie bestaat uit 100.000 containers Inspectie bestaat uit controle van 1.000 containers Stel dat 1 container uit deze steekproef wordt afgekeurd • Dan is pMoM=0.001, en std(pMoM)=0.001 • Als std(pMoM)<0.0001, dan inspectie van volledige populatie (immers std(pMoM)=sqrt(pq/N)sqrt(p/N)) Inspectie volledige populatie (bij kleine p-waarden) • Inspectiekosten moeten zich terugverdienen uit de boete-opbrengsten • Inspectiekosten: 100.000 x K • Opbrengst zonder inspectie: p x 100.000 x NI (Negative Impact) • Opbrengst met inspectie: p x 100.000 x boete – 100.000 x K • p x 100.000 x boete – 100.000 x K > p x 100.000 x NI • boete > K/p + NI Bayesian analysis of a one-parameter model I. The binomial distribution—uniform prior II. Posterior Interpretation III. Binomial distribution—beta prior Conjugate priors and sufficient statistics Review of the Bayesian Setup From the Bayesian perspective, there are known and unknown quantities. - The known quantity is the data, denoted D. - The unknown quantities are the parameters (e.g. mean, variance, missing data), denoted . To make inferences about the unknown quantities, we stipulate a joint probability function that describes how we believe these quantities behave in conjunction, p( and D). Using Bayes’ Rule, this joint probability function can be rearranged to make inference about : p( | D ) = p( ) p( D| ) / p( D ) Review of the Bayesian Set-Up cont. p( ) p( D | ) p ( | D ) p( D ) p( ) L( | D) p( ) p( D | )d L( | D ) is the likelihood function for p()p(D| )d is the normalizing constant or the prior predictive distribution. It is the normalizing constant because it ensures that the posterior distribution of integrates to one. It is the prior predictive distribution because it is not conditional on a previous observation of the data-generating process (prior) and because it is the distribution of an observable quantity (predictive). Review of the Bayesian Set-Up cont. p( ) p( D | ) p( ) L( | D ) p ( | D ) p( D ) p( ) p( D | )d This is often rewritten in more compact notation p ( | D ) p ( ) L( | D ) i.e. posterior prior x likelihood Example: The Binomial Distribution Suppose X1, X2, …, Xn are independent random draws from the same Bernoulli distribution with parameter . Thus, Xi ~ Bernoulli( ) for i {1, ... , n} or equivalently, Y = Xi ~ Binomial( , n) The joint distribution of Y and is the product of the conditional distribution of Y and the prior distribution . What distribution might be a reasonable choice for the prior distribution of ? Binomial Distribution cont. If Y ~ Bin(, n), a reasonable prior distribution for must be bounded between zero and one. One option is the uniform dist. ~ Unif( 0, 1 ). p( | Y ) fUnif (0,1) f Bin (Y | ) n Y 1 * (1 )n Y Y p( | Y ) is the posterior distribution of Binomial Distribution cont. Let Y ~ Bin(, n) and ~ Unif( 0, 1 ). p( | Y ) fUnif (0,1) f Bin (Y | ) n 1 * Y (1 ) n Y Y Y (1 ) n Y The pdf for the beta distributi on which is known to be proper is : Note : ( k ) ( k 1)! Γ(α + β) α 1 Beta(x| α,β) x (1 x ) β 1 ( 0 x 1 and α,β 0). [Gamma Fun ction] Γ(α)Γ(β) Let x , α Y + 1, β n Y + 1 Γ(n + 2 ) Thus, p(π | Y, n) ~ Beta(Y + 1, n - Y + 1) x (Y +1)1 (1 x )( n Y +1)1 Γ(Y + 1 )Γ ( n Y + 1 ) Γ(n + 2 ) p(π | Y, n) xY (1 x ) n Y Γ(Y + 1 )Γ ( n Y + 1 ) This is the normalization constant to transform y(1-)n-y into a beta Application - Taxi licenses An inspector from IVW examined the number of licenses denoted among n = 24 taxi drivers at Den Haag HS about whether or not they have a valid license. In this case, 17 drivers showed the valid license. Let Xi = 1 if driver i showed the valid license and Xi = 0 otherwise. Let i Xi = Y ~ Bin(,24) and let ~ Unif(0,1) Based on the previous slide: p(|Y,n) ~ Beta(Y+1, n-Y+1). Substitute n = 24 and Y= 17 into the posterior distribution. Thus, p(|Y,n) = Beta(18,8) The Posterior Distribution 2 1 0 posterior 3 4 The posterior distribution summarizes all that we know after analyzing the data How do we interpret the posterior distribution: p(|Y,n) = Beta(18,8) One option is graphically… 0.0 0.2 0.4 0.6 p 0.8 1.0 Posterior Summaries The full posterior contains too much information, especially in multi-parameter models. So, we use summary statistics (e.g. mean, var, HDR). 2 Methods for generating summary stats: 1) Analytical Solutions: use the well-known analytic solutions for the mean, variance, etc. of the various posterior distribution. 2) Numerical Solutions: use a random number generator to draw a large number of values from the posterior distribution, then compute summary stats from those random draws. Analytic Summaries of the Posterior Continuing our example, p(|Y,n) ~ Beta(18,8) If ~ Beta( , ), analytical ly E( ) α α+ β αβ (α + β)2(α + β + 1 ) α 1 Mode( ) α+ β2 Var( ) 18 E ( ) 0.69 18 + 8 18(8) Var( ) 0.01 2 (18 + 8) (18 + 8 + 1) 18 1 Mode ( ) 0.71 18 + 8 2 Numerical Summaries of the Posterior To create numerical summaries from the posterior, you need a random number generator. To summarize p(|Y,n) ~ Beta(18,8) • Draw a large number of random samples from a Beta(18,8) distribution • Calculate the sample statistics from that set of random samples. Numerical Summaries of the Posterior Mean()=.70 Output from Matlab Median()=.70 Var()=.01 rands 0.4 0.5 0.6 0.7 0.8 0.9 20 80 0 40 60 0 20 40 60 80 0.4 0.5 0.6 0.7 rands 0.8 0.9 Highest [Posterior] Density Regions (also known as Bayesian confidence or credible intervals) Highest Density Regions (HDR’s) are intervals containing a specified posterior probability. The figure below plots the 95% highest posterior density region. 2 1 95% HDR 0 posterior 3 4 Beta(18,8) 0.0 0.2 0.4 [.51,.84 ] 0.6 p 0.8 1.0 Identification of the HDR It is easiest to find the Highest Density Region numerically. In Matlab, to find the 95% HDR # take 1000 draws from the posterior # sort the random from highest to lowest, then identify the thresholds for the 95% credible interval. Confidence Intervals vs. Bayesian Credible Intervals Differing interpretations… The Bayesian credible interval is the probability given the data that a true value of lies in the interval. Technically, P(Interval)|X)=Intervalp( | X )d The frequentist -percent confidence interval is the region of the sampling distribution for such that given the observed data one would expect (100-) percent of the future estimates of to be outside that interval. Technically, = 1-a to b g( u | )du These limits are functions of the data U is a dummy variable of integration for the estimated value of Confidence Intervals vs. Bayesian Credible Intervals But often the results appear similar… If Bayesians use “non-informative priors” and there is a large number of observations, often several dozen will do, HDRs and frequentist confidence intervals will coincide numerically. Returning to the Binomial Distribution If Y ~ Bin(n,), the uniform prior is just one of an infinite number of possible prior distributions. What other distributions could we use? A reasonable alternative to the unif(0,1) distribution is the beta distribution. ( + ) 1 For random variable , Beta( , ) (1 ) 1 ( )( ) Prior Consequences Plots of 4 Different Beta Distributions Beta(3,1 0) post 0 0.0 0.5 1 1.0 post 2 1.5 2.0 3 2.5 Beta(5, 5) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 x 0.8 1.0 6 8 10 Beta(100,30) 0 0 2 1 4 post 2 3 Beta(10, 3) post 0.6 x 0.0 0.2 0.4 0.6 x 0.8 1.0 0.0 0.2 0.4 0.6 x 0.8 1.0 The Binomial Distribution with Beta Prior If Y ~ Bin(n,) and ~ Beta(,), then: n Y ( + ) 1 p (1 p ) 1 p (1 p ) n Y Y ( ) ( ) f ( | Y ) 1 ( + ) 1 1 n Y n Y ( 1 ) ( 1 ) d 0 ( )( ) Y Let' s focus on f(y) (the denominato r) : 1 f ( y) 0 n ( + ) 1 (1 ) 1 Y (1 )n Y d ( )( ) Y The posterior predictive distribution n Y ( + ) 1 1 n Y ( 1 ) ( 1 ) d 0 ( )( ) Y 1 f ( y) ( n + 1) ( + ) Y + 1 n + Y 1 ( 1 ) d (Y + 1) ( n Y + 1) ( ) ( ) 0 1 This is the kernel of the beta distribution ( n + 1) ( + ) f(y) Y + 1 (1 )n + Y 1 d (Y + 1)(n Y + 1) ( ) ( ) 0 1 ( n + 1) ( + ) (Y + ) ( n + B Y ) ( + n + ) Y + 1 (1 )n + Y 1 d (Y + 1) (n Y + 1)( ) ( ) ( + n + ) (Y + ) ( n + B Y ) 0 1 ( + n + ) Y + 1 n + Y 1 d is the integral of the beta pdf over 0 (Y + )(n + B Y ) (1 ) 1 Since the parameter space for π, this expression equals one. Thus, f(y) (n + 1)( + ) (Y + )(n + B Y ) (Y + 1)(n Y + 1)( )( ) ( + n + ) This is called a beta-binomial distribution The posterior of the binomial model with beta priors ( n + 1) ( + ) (Y + )(n + B Y ) , (Y + 1) ( n Y + 1)( )( ) ( + n + ) ( n + 1) ( + ) 1 pY (1 p )n Y p (1 p ) 1 (Y + 1) (n Y + 1) ( )( ) f ( | y ) (n + 1) ( + ) (Y + )(n + B Y ) (Y + 1) ( n Y + 1)( ) ( ) ( + n + ) Simplify t he above expression , so ( + n + ) f ( | y ) pY + 1 (1 p )n + Y 1 (Y + ) (n + B Y ) Since f(y) This is a Beta(Y+, n-Y+) distribution. Beautifully, it worked out that the posterior distribution is a form of the prior distribution updated by the new data. In general, when this occurs we say the prior is conjugate. Continuing the earlier example, if 17 of 24 taxi drivers is with valid license (so Y=17 and n = 24, where Y is a binomial) and you use a Beta(5,5) prior, the posterior distribution is Beta(17+5,24-17+5) = Beta(22,12) Posterior Mean = .65 5 Posterio r 4 Posterior Variance = .01 3 Prior 2 1 0 0.1 0.3 0.5 x 0.7 0.9 Prior Consequences Plots of 4 Different Beta Distributions Beta(3,1 0) post 0 0.0 0.5 1 1.0 post 2 1.5 2.0 3 2.5 Beta(5, 5) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 x 0.8 1.0 6 8 10 Beta(100,30) 0 0 2 1 4 post 2 3 Beta(10, 3) post 0.6 x 0.0 0.2 0.4 0.6 x 0.8 1.0 0.0 0.2 0.4 0.6 x 0.8 1.0 Comparison of four different posterior distributions (in red) for the four different priors (black) 2 1 prior 2 3 Prior: Beta(10,3) Post: Beta(27,1 0) 1 4 3 Prior: Beta(5,5) Post: Beta(22,1 2) 5 0 0 0.1 0.3 0.5 0.7 0.0 0.9 0.2 0.4 x 0.4 0.6 x 1.0 10 8 6 4 2 0 0.2 0.8 Prior: Beta(100,3 0)Post: Beta(117,3 7) prior 0 1 2 prior 3 Prior: Beta(3,10 ) Post: Beta(20,1 7) 0.0 0.6 x 0.8 1.0 0.0 0.2 0.4 0.6 x 0.8 1.0 Summary Statistics of the Findings for different priors Summary Table Prior Mean Prior Var. Posterior Mean Posterior Var. Prior: Beta(1,1) Post: Beta(18,8) .5 .08 .692 .008 Prior: Beta(5,5) Post: Beta(22,12) .5 .02 .647 .007 Prior: Beta(3,10) Post: Beta(20,17) .23 .01 .541 .007 Prior: Beta(10,3) Post: Beta(27,10) .77 .01 .730 .005 Prior: Beta(100,30) Post: Beta(117,37) .77 .001 .760 .001 Resume • De basis van de theorie der kansrekening als fundament voor de cursus; • Schatten van verdelingsparameters; • Steekproef theorie, waarbij zowel met als zonder voor-informatie wordt gewerkt (Bayesiaanse versus Klassieke steekproeven).