Download x - TU Delft: CiTG

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Bayesian inference wikipedia , lookup

Transcript
Kansrekening en
steekproeftheorie
Pieter van Gelder
TU Delft
IVW-Cursus, 16 September 2003
De basis van de theorie der kansrekening
als fundament voor de cursus;
Schatten van verdelingsparameters;
Steekproef theorie, waarbij zowel met als
zonder voor-informatie wordt gewerkt
(Bayesiaanse versus Klassieke
steekproeven);
Afhankelijkheden tussen variabelen en
risico's.
Inspection in Civil Engineering
Stochastic variables
Outline
•
•
•
•
•
•
What is a stochastic variable?
Probability distributions
Fast characteristics
Distribution types
Two stochastic variables
Closure
Stochastic variable
•
Quantity that cannot be predicted exactly
(uncertainty):
– Natural variation
– Shortage of statistical data
– Schematizations
Examples:
–
–
–
–
Strength of concrete
Water level above a tunnel
Lifetime of a chisel
Throw of a dice
Relation to events
• Express uncertainty in terms of probability
• Probability theory related to events
• Connect value of variable to event
• E.g. probability that stochastic variable X
–
–
–
–
–
is less than x
is greater than x
is equal to x
is in the interval [x, x+ x]
etc.
Probability distribution
• Probability distribution function = probability P(Xx):
•
FX(x) = P(Xx)
1
stochast
0.6
X
F (x)
dummy
0.8
0.4
0.2
0
x
Probability density
• Familiar form probability ’distribution’:
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
-1
0
1
x
2
3
4
5
• This is probability density function
Probability density
• Differentiation of F to x:
• fX(x) = dFX(x) / dx
• f = probability density function
• fX(x) dx = P(x < X  x+dx)
1
P(X x)
0.6
X
F (x)
0.8
0.4
0.2
0
x
0.5
fX(x)
0.4
P(x < X  x+d x)
0.3
0.2
0.1
0
x x+d x
1
0.6
X
F (x)
0.8
0.4
0.2
0
P(X x)
x
0.5
fX(x)
0.4
0.3
0.2
0.1
0
x
Discrete and continuous
discrete variable:
0.4
1
0.35
0.9
0.3
0.8
F (x)
0.7
X
X
p (x)
0.25
0.2
0.6
0.15
0.5
0.1
0.4
0.05
0
0
1
2
3
4
5
0.3
6
0
1
2
3
x
x
4
5
6
1
0.4
0.8
F (x)
0.5
0.6
X
0.3
X
f (x)
continuous variable:
0.2
0.4
0.1
0.2
0
-4
-2
0
x
2
4
probability density
6
0
-4
-2
0
x
2
4
6
(cumulative)
probability distribution
Fast characteristics
0.5
0.3
X
f (x)
0.4
0.2
sX
0.1
0
-4
mX
sX
-2
0
mX
2
4
x
6
mean, indication of location
standard deviation, indication for spread
Fast characteristics
0.7
0.6
0.5
fX(x) 0.4
0.3
0.2
sX
0.1
0
0
1
2
mX
3
4
Mean  location maximum (mode)
x
5
Fast characteristics

m X   x fX x ) dx

• Mean
•
(centre of gravity)
• Variance

s X2   x  m X ) fX x ) dx
2

•
• Standard deviation
• Coefficient of variation
sX
VX 
sX
mX
Normal distribution
Normal distributions
1
fX(x)
0.8
0.6
sX
0.4
sX
0.2
0
-4
-2
0
mX
2
4
x
6
Completely determined by mean and standard deviation
Normal distribution
• Probability density function
1
f X x ) 
e
s 2
1 x  m 
 

2 s 
2
• Standard normally distributed variable
• (often denoted by u):
mu  0
su  1
Normal distribution
• Why so popular?
• Central limit theorem:
•
Sum of many variables with arbitrary
distributions is (almost) normally distributed.
• Convenient in structural reliability calculations
Two stochastic variables
joint probability density function
Kansdichtheid
0.2
0.15
0.1
0.05
0
2
1
2
1
0
0
-1
y
-1
-2
-2
x
Contour map probability density
2
y
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
x
2
Two stochastic variables
• Relation to events
fXY x , ) dx d  Px  X  x + dx en   Y   + d )
FXY x , )  PX  x en Y   )
2
1.5
1
y
0.5
0
-0.5
d
-1
dx
-1.5
-2
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Example
• Health survey.
• Length
kansdichtheid (1/m)
• Measurements of:
3
2.5
2
1.5
1
0.5
0
1.2
1.4
1.6
1.8
2
2.2
lengte (m)
• Weight
kansdichtheid (1/kg)
0.05
0.04
0.03
0.02
0.01
0
50
60
70
80
gewicht (kg)
90
100
110
2.4
2.6
Logical contour map?
110
100
weight (kg)
90
80
70
60
50
1.4
1.6
1.8
length (m)
2
2.2
Dependency
110
100
weight (kg)
90
80
70
60
50
1.4
1.6
1.8
length (m)
2
2.2
Fast characteristics
• Location:
mX, mY
means
• Spread
s X, s Y
standard deviation
• Dependency
• covXY
rXY = covXY / sX sY
covariance
correlation, between -1 and 1
Independent variables
FXY x , )  FX x ) FY  )
fXY x , )  fX x ) fY  )
cov XY  0
r XY 
cov XY
sX sY
0
Closure of the short
Introduction to Stochastics
•
•
•
•
•
What is a stochastic variable?
Probability distributions
Fast characteristics
Distribution types
Two stochastic variables
Parameter estimation methods
• Given a dataset x1, x2, …, xn
• Given a distribution type F(x|A,B,…)
• How to estimate the unknown parameters
A,B,… to the data?
List of estimation methods
•
•
•
•
MoM
ML
LS
Bayes
MoM
• Distribution moments = Sample moments
xnf(x)dx = xin
F(x) = 1- exp[-(x-A)/B]
AMOM = std(x)
BMOM = mean(x) +std(x)
Binomial distribution
• X~Bin(N,p)
• The binomial distribution gives the discrete
probability distribution of obtaining exactly n
successes out of N Bernoulli trials (where the
result of each Bernoulli trial is true with
probability p and false with probability q=1-p).
The binomial distribution is therefore given by
• fX(n) =
E(X) = Np; var(X)=Npq
MoM-estimator of p
• pMOM = xi / N
for j=1:M,
x=0;
for I=1:N,
if rand(1)<p, x(I)=1; end
end
y(j)=sum(x);
end
for j=1:M,
pMOM(j)=y(j)/N;
end
hist(pMOM)
300
250
200
Frequency
•
•
•
•
•
•
•
•
•
•
•
Performance of p-estimation (N=10; p=0.2)
150
100
50
0
0
0.1
0.2
0.3
0.4
p
0.5
0.6
0.7
Case Study
• Webtraffic statistics
– The number of pageviews on websites
Statistics on Usage of Screen sizes
• Is it necessary to
download from every
user his/her screen size?
• Is it sufficient to inspect
the screen size of just N
users, and still have a
reliable percentage of the
used screen sizes?
Assume 41% of the complete
population uses size 1024x768
• Inspection population
size N = 100, 1000,
…and simulate the
results by generating the
usage from a Binomial
distribution.
• Theoretical analysis:
Cov=sqrt(1/p - 1)N-1/2
Coefficient of variations
(as a function of p and N)
P
N
100
1000
10 000
106
41.4%
11.75%
3.7%
1.2%
0.1%
39.8%
12.3%
3.9%
1.3%
0.1%
6.2%
38.9%
12.3%
3.9%
0.4%
5.4%
41.8%
13.2%
4.2%
0.4%
3.2%
55.0%
17.4%
5.5%
0.55%
Optimisation of the inspection
sample size
• Assume the costs of getting screen size
information from a user is A
• Assume the costs of having a larger cov-value is B
• TC(N) = A.N + B.sqrt(1/p - 1)N-1/2
• The optimal sample size follows from TC’(N) = 0,
giving N* = B/2A.(1/p - 1)-2/3
• For this choice of N, the cov = (2A/B.(1/p – 1))1/3
Case study container inspectie
•
•
•
•
Toelaatbare ‘ontglip kans’ p = 1/1.000 containers
Populatie bestaat uit 100.000 containers
Inspectie bestaat uit controle van 1.000 containers
Stel dat 1 container uit deze steekproef wordt
afgekeurd
• Dan is pMoM=0.001, en std(pMoM)=0.001
• Als std(pMoM)<0.0001, dan inspectie van volledige
populatie (immers std(pMoM)=sqrt(pq/N)sqrt(p/N))
Inspectie volledige populatie
(bij kleine p-waarden)
• Inspectiekosten moeten zich terugverdienen uit de
boete-opbrengsten
• Inspectiekosten: 100.000 x K
• Opbrengst zonder inspectie: p x 100.000 x NI
(Negative Impact)
• Opbrengst met inspectie: p x 100.000 x boete –
100.000 x K
• p x 100.000 x boete – 100.000 x K > p x 100.000 x NI
• boete > K/p + NI
Bayesian analysis of a one-parameter
model
I. The binomial distribution—uniform prior
II. Posterior Interpretation
III. Binomial distribution—beta prior
Conjugate priors and sufficient statistics
Review of the Bayesian Setup
From the Bayesian perspective, there are known and
unknown quantities.
- The known quantity is the data, denoted D.
- The unknown quantities are the parameters (e.g.
mean, variance, missing data), denoted .
To make inferences about the unknown quantities, we
stipulate a joint probability function that describes
how we believe these quantities behave in
conjunction, p( and D).
Using Bayes’ Rule, this joint probability function can be
rearranged to make inference about :
p(  | D ) = p(  ) p( D|  ) / p( D )
Review of the Bayesian Set-Up cont.
p( ) p( D |  )
p ( | D ) 

p( D )

p( ) L( | D)
p( ) p( D |  )d
L(  | D ) is the likelihood function for 
p()p(D| )d is the normalizing constant or
the prior predictive distribution.
It is the normalizing constant because it ensures that the
posterior distribution of  integrates to one.
It is the prior predictive distribution because it is not
conditional on a previous observation of the data-generating
process (prior) and because it is the distribution of an observable
quantity (predictive).
Review of the Bayesian Set-Up cont.
p( ) p( D |  )
p( ) L( | D )
p ( | D ) 

p( D )
 p( ) p( D |  )d

This is often rewritten in more compact notation
p (  | D )  p (  ) L(  | D )
i.e. posterior  prior x likelihood
Example: The Binomial Distribution
Suppose X1, X2, …, Xn are independent random draws from the
same Bernoulli distribution with parameter .
Thus, Xi ~ Bernoulli(  )
for i  {1, ... , n}
or equivalently, Y =  Xi ~ Binomial(  , n)
The joint distribution of Y and  is the product of the conditional
distribution of Y and the prior distribution .
What distribution might be a reasonable choice for the prior
distribution of ?
Binomial Distribution cont.
If Y ~ Bin(, n), a reasonable prior distribution for 
must be bounded between zero and one.
One option is the uniform dist.  ~ Unif( 0, 1 ).
p( | Y )  fUnif (0,1) f Bin (Y |  )
n Y
 1 *   (1   )n Y
Y 
p( | Y ) is the posterior distribution of 
Binomial Distribution cont.
Let Y ~ Bin(, n) and  ~ Unif( 0, 1 ).
p( | Y )  fUnif (0,1) f Bin (Y |  )
n
 1 *   Y (1   ) n Y
Y 
  Y (1   ) n Y
The pdf for the beta distributi on which is known to be proper is :
Note : ( k )  ( k  1)!
Γ(α + β) α 1
Beta(x| α,β) 
x (1  x ) β 1 ( 0  x  1 and α,β  0).
[Gamma Fun ction]
Γ(α)Γ(β)
Let x   , α  Y + 1, β  n  Y + 1
Γ(n + 2 )
Thus, p(π | Y, n) ~ Beta(Y + 1, n - Y + 1) 
x (Y +1)1 (1  x )( n Y +1)1
Γ(Y + 1 )Γ ( n  Y + 1 )
Γ(n + 2 )
p(π | Y, n) 
xY (1  x ) n Y
Γ(Y + 1 )Γ ( n  Y + 1 )
This is the normalization constant
to transform y(1-)n-y into a beta
Application - Taxi licenses
An inspector from IVW examined the number of
licenses denoted  among n = 24 taxi drivers at Den
Haag HS about whether or not they have a valid
license. In this case, 17 drivers showed the valid
license.
Let Xi = 1 if driver i showed the valid license and Xi = 0
otherwise.
Let i Xi = Y ~ Bin(,24) and let  ~ Unif(0,1)
Based on the previous slide:
p(|Y,n) ~ Beta(Y+1, n-Y+1).
Substitute n = 24 and Y= 17 into the posterior
distribution.
Thus, p(|Y,n) = Beta(18,8)
The Posterior Distribution
2
1
0
posterior
3
4
The posterior distribution summarizes all that we know after
analyzing the data
How do we interpret the posterior distribution:
p(|Y,n) = Beta(18,8)
One option is graphically…
0.0
0.2
0.4
0.6
p
0.8
1.0
Posterior Summaries
The full posterior contains too much information, especially
in multi-parameter models. So, we use summary statistics
(e.g. mean, var, HDR).
2 Methods for generating summary stats:
1) Analytical Solutions: use the well-known analytic
solutions for the mean, variance, etc. of the various
posterior distribution.
2) Numerical Solutions: use a random number generator
to draw a large number of values from the posterior
distribution, then compute summary stats from those
random draws.
Analytic Summaries of the Posterior
Continuing our example, p(|Y,n) ~ Beta(18,8)
If  ~ Beta(  ,  ), analytical ly
E(  ) 
α
α+ β
αβ
(α + β)2(α + β + 1 )
α 1
Mode(  ) 
α+ β2
Var( ) 
18
E ( ) 
 0.69
18 + 8
18(8)
Var( ) 
 0.01
2
(18 + 8) (18 + 8 + 1)
18  1
Mode ( ) 
 0.71
18 + 8  2
Numerical Summaries of the Posterior
To create numerical summaries from the posterior,
you need a random number generator.
To summarize p(|Y,n) ~ Beta(18,8)
• Draw a large number of random samples from a
Beta(18,8) distribution
• Calculate the sample statistics from that set of
random samples.
Numerical Summaries of the Posterior
Mean()=.70
Output from Matlab
Median()=.70
Var()=.01
rands
0.4
0.5
0.6
0.7
0.8
0.9
20
80
0
40
60
0
20
40
60
80
0.4
0.5
0.6
0.7
rands
0.8
0.9
Highest [Posterior] Density Regions
(also known as Bayesian confidence or credible intervals)
Highest Density Regions (HDR’s) are intervals
containing a specified posterior probability. The
figure below plots the 95% highest posterior density
region.
2
1
95%
HDR
0
posterior
3
4
Beta(18,8)
0.0
0.2
0.4
[.51,.84
]
0.6
p
0.8
1.0
Identification of the HDR
It is easiest to find the Highest Density Region numerically.
In Matlab, to find the 95% HDR
# take 1000 draws from the posterior
# sort the random from highest to lowest, then identify the
thresholds for the 95% credible interval.
Confidence Intervals vs.
Bayesian Credible Intervals
Differing interpretations…
The Bayesian credible interval is the probability given the data
that a true value of  lies in the interval.
Technically, P(Interval)|X)=Intervalp(  | X )d
The frequentist -percent confidence interval is the region of
the sampling distribution for  such that given the observed
data one would expect (100-) percent of the future estimates
of  to be outside that interval.
Technically,  = 1-a to b g( u |  )du
These limits are
functions of the
data
U is a dummy variable of
integration for the estimated
value of 
Confidence Intervals vs.
Bayesian Credible Intervals
But often the results appear similar…
If Bayesians use “non-informative priors” and there is a large
number of observations, often several dozen will do, HDRs
and frequentist confidence intervals will coincide
numerically.
Returning to the Binomial Distribution
If Y ~ Bin(n,), the uniform prior is just one of an infinite
number of possible prior distributions.
What other distributions could we use?
A reasonable alternative to the unif(0,1) distribution is the beta
distribution.
( +  )  1
For random variable  , Beta(  ,  ) 
 (1   )  1
( )(  )
Prior Consequences
Plots of 4 Different Beta Distributions
Beta(3,1
0)
post
0
0.0
0.5
1
1.0
post
2
1.5
2.0
3
2.5
Beta(5,
5)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
x
0.8
1.0
6
8
10
Beta(100,30)
0
0
2
1
4
post
2
3
Beta(10,
3)
post
0.6
x
0.0
0.2
0.4
0.6
x
0.8
1.0
0.0
0.2
0.4
0.6
x
0.8
1.0
The Binomial Distribution with Beta Prior
If Y ~ Bin(n,) and  ~ Beta(,), then:
n Y
( +  )  1
p (1  p )  1
  p (1  p ) n Y 
Y
( ) (  )
f ( | Y )  1  
( +  )  1
 1  n  Y
n Y

(
1


)


(
1


)
d


0 ( )(  )
Y 
 
Let' s focus on f(y) (the denominato r) :
1
f ( y)  
0
n
( +  )  1
 (1   )  1    Y (1   )n Y d
( )(  )
Y 
The posterior predictive distribution
n Y
( +  )  1
 1
n Y

(
1


)


(
1


)
d


0 ( )(  )
Y 
 
1
f ( y) 
( n + 1) ( +  )
Y + 1
n +  Y 1


(
1


)
d

(Y + 1) ( n  Y + 1) ( ) (  ) 0
1
This is the kernel of the beta distribution
( n + 1) ( +  )
f(y) 
 Y + 1 (1   )n +  Y 1 d

(Y + 1)(n  Y + 1) ( ) (  ) 0
1
( n + 1) ( +  )
(Y +  ) ( n + B  Y )
( + n +  )

 Y + 1 (1   )n +  Y 1 d

(Y + 1) (n  Y + 1)( ) (  )
( + n +  )
(Y +  ) ( n + B  Y )
0
1
( + n +  )
Y + 1
n +  Y 1
d is the integral of the beta pdf over
0 (Y +  )(n + B  Y )  (1   )
1
Since
the parameter space for π, this expression equals one.
Thus, f(y) 
(n + 1)( +  )
(Y +  )(n + B  Y )
(Y + 1)(n  Y + 1)( )(  )
( + n +  )
This is called a beta-binomial distribution
The posterior of the binomial model with beta priors
( n + 1) ( +  )
(Y +  )(n + B  Y )
,
(Y + 1) ( n  Y + 1)( )(  )
( + n +  )
( n + 1)
( +  )  1
pY (1  p )n Y 
p (1  p )  1
(Y + 1) (n  Y + 1)
( )(  )
f ( | y ) 
(n + 1) ( +  )
(Y +  )(n + B  Y )
(Y + 1) ( n  Y + 1)( ) (  )
( + n +  )
Simplify t he above expression , so
( + n +  )
f ( | y ) 
pY + 1 (1  p )n +  Y 1
(Y +  ) (n + B  Y )
Since f(y) 
This is a Beta(Y+, n-Y+) distribution.
Beautifully, it worked out that the posterior distribution is a form
of the prior distribution updated by the new data. In general, when
this occurs we say the prior is conjugate.
Continuing the earlier example, if 17 of 24 taxi drivers is
with valid license (so Y=17 and n = 24, where Y is a
binomial) and you use a Beta(5,5) prior, the posterior
distribution is Beta(17+5,24-17+5) = Beta(22,12)
Posterior Mean = .65
5
Posterio
r
4
Posterior Variance = .01
3
Prior
2
1
0
0.1
0.3
0.5
x
0.7
0.9
Prior Consequences
Plots of 4 Different Beta Distributions
Beta(3,1
0)
post
0
0.0
0.5
1
1.0
post
2
1.5
2.0
3
2.5
Beta(5,
5)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
x
0.8
1.0
6
8
10
Beta(100,30)
0
0
2
1
4
post
2
3
Beta(10,
3)
post
0.6
x
0.0
0.2
0.4
0.6
x
0.8
1.0
0.0
0.2
0.4
0.6
x
0.8
1.0
Comparison of four different posterior distributions
(in red) for the four different priors (black)
2
1
prior
2
3
Prior:
Beta(10,3)
Post:
Beta(27,1
0)
1
4
3
Prior:
Beta(5,5)
Post:
Beta(22,1
2)
5
0
0
0.1
0.3
0.5
0.7
0.0
0.9
0.2
0.4
x
0.4
0.6
x
1.0
10
8
6
4
2
0
0.2
0.8
Prior:
Beta(100,3
0)Post:
Beta(117,3
7)
prior
0
1
2
prior
3
Prior:
Beta(3,10
) Post:
Beta(20,1
7)
0.0
0.6
x
0.8
1.0
0.0
0.2
0.4
0.6
x
0.8
1.0
Summary Statistics of the Findings for different priors
Summary Table
Prior
Mean
Prior
Var.
Posterior
Mean
Posterior
Var.
Prior: Beta(1,1)
Post: Beta(18,8)
.5
.08
.692
.008
Prior: Beta(5,5)
Post: Beta(22,12)
.5
.02
.647
.007
Prior: Beta(3,10)
Post: Beta(20,17)
.23
.01
.541
.007
Prior: Beta(10,3)
Post: Beta(27,10)
.77
.01
.730
.005
Prior: Beta(100,30)
Post: Beta(117,37)
.77
.001
.760
.001
Resume
• De basis van de theorie der kansrekening als
fundament voor de cursus;
• Schatten van verdelingsparameters;
• Steekproef theorie, waarbij zowel met als
zonder voor-informatie wordt gewerkt
(Bayesiaanse versus Klassieke
steekproeven).