Download Analysis of Regression Confidence Intervals and Bayesian

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Auxiliary Material for Manuscript
2011WR011289:
Analysis of Regression Confidence Intervals
and Bayesian Credible Intervals for Uncertainty
Quantification1
Dan Lu2, Ming Ye2, Mary C. Hill3
2
Department of Scientific Computing, Florida State University, Tallahassee,
Florida, USA
3
U.S. Geological Survey, Boulder, Colorado, USA
March 6, 2012
1
Lu, D., M. Ye, and M. C. Hill (2012), Analysis of regression confidence intervals and Bayesian
credible intervals for uncertainty quantification. Water Resour. Res. DOI:
10.1029/2011WR011289.
Appendix A: Derivation of Posterior Parameter Distribution for Noninformative Prior
For a linear model y  Xβ  ε with ε
Nn (0, C ) , where C   2ω 1 , assume that β and σ
have independent prior distributions and denote θ  (β,  ) , then p(θ)  p(β) p( ) . With
Jeffery’s’ noninformative priors, i.e., p (β)  constant and p ( )  1/  , based on Bayes’
theorem p(θ | y ) 
p(y | θ) p(θ)
with Gaussian likelihood function,
p(y )
T
 1

exp   2  y  Xβ  ω  y  Xβ  
 2
,
p  y | θ 
1/2
n /2
2
 2   ω
(A1)
we have
p(β,  | y )  p  y | θ   1
(A2)
T
 1

   ( n 1) exp   2  y  Xβ  ω  y  Xβ   .
 2

Since ω is positive-definite, there exists an n  n nonsingular matrix K such that ω  K T K ,
therefore,
 y  Xβ 
T
ω  y  Xβ    y  Xβ  K T K  y  Xβ 
T
 Ky  KXβ
2
2
 Ky  KXbˆ  KXbˆ  KXβ
2
 (y  Xbˆ )T ω(y  Xbˆ )  KXbˆ  KXβ
,
(A3)
2
 (n  p) s 2  (β  bˆ )T XT ωX(β  bˆ )
where
1
defines the L2 norm of a vector, b̂ is the least-square estimate, bˆ   XT ωX  XT ωy ,
and s2 is an unbiased estimate of σ2, s 2  (y  Xbˆ )T ω(y  Xbˆ ) / (n  p) . Substituting (A3) into
(A2) leads to
 1

p(β,  | y )    ( n 1) exp   2 (n  p) s 2  (β  bˆ )T XT ωX(β  bˆ )   .
 2

(A4)
The distribution of β can be obtained by integrating (A2) with respect to σ2. Analogy to


0
x  ( n 1) exp(a / x 2 )dx 
1  n /2
a (n / 2)
2
(A5)
of gamma distribution, the posterior distribution of β is

p(β | y )   p(β,  | y )d
0
T
  y  Xβ  ω  y  Xβ  


(A6)
 n /2
.
Substituting (A3) into (A6) leads to
 (β  bˆ )T XT ωX(β  bˆ ) 
p(β | y )  1 

(n  p) s 2


 n /2
.
(A7)
Equation (A7) is a special case of the p-dimensional multivariate t-distribution
1
( (v  p))
 ( v  p )/2
2
1  v 1 (t  μ)T Σ 1 (t  μ) 
p (t ) 
1
( v) p /2 ( v) | Σ |1/2
2
with v=(n-p), Σ  s 2  XT ωX 
1
(A8)
and μ  bˆ .
If σ is known, it follows from (A4) that
 1

 1

p(β | y )  exp   2 (β  bˆ )T XT ωX(β  bˆ )   exp   (β  bˆ )T XT C1X(β  bˆ )  ,
 2

 2

which is multivariate Gaussian distribution.
(A9)
Appendix B: Equivalence of Credible and Confidence Intervals for Consistent Prior
Information
In Bayesian analysis, for a linear model y  Xβ  ε with ε
Nn (0, C ) , where variance-
covariance matrix C is known, the conjugate prior distribution of parameters β is assumed as
(B1)
p (β) : N p (β p ,C p ) ,
and likelihood function as
p ( y | β) =
1
(2 ) n /2
 1

exp   (y  Xβ) C1 (y  Xβ) 
| C |
 2
.
Then based on the Bayes’ theorem p(β | y ) 
(B2)
p(y | β) p(β)
, the posterior distribution of
p(y )
parameters is
1
 1

p(β | y )  exp   (y  Xβ) C1 (y  Xβ)  (β  β p )  Cp1 (β  β p ) 
2
 2

 1
 exp    y C1y  y C1Xβ  β  XC1y  β  XC1Xβ
 2
β Cp1β  β Cp1β p  β p Cp1β  β pCp1β p  
(B3)
 1
 exp  β   XC1X  Cp1  β   y C1X  β pCp1  β
 2
β   XC1y  Cp1β p    y C1y  β pCp1β p  
.
Because the covariance matrices are symmetric, the terms in the square bracket in (B3) can be
generalized as
β Aβ  Bβ  βB  G   β  A1B  A  β  A1B    G  B A1B  ,

where
(B4)
A  XC1X  Cp1 , B  XC1y  Cp1β p , and G  y C1y  βpCp1β p . Because
G  B  A 1B is irrelevant to β and can be treated as a constant, (B3) can be simplified as

 1

p(β | y )  exp    β  β 'p   XC1X  Cp1  β  β 'p   ,
 2

where β'p   XT C1X  Cp1 
1
X C
T
(B5)
y  Cp1β p  . Thus the posterior distribution p(β | y ) is
1

p (β | y ) : N p  β 'p , C'p  ,
(B6)
with
1
C'p   XT C1X  Cp1  .
(B7)
Correspondingly, the linear prediction function g (β ) (i.e., g(β)=Zβ) has
g (β)
N p ( g (βp ), ZT (XT C1X  Cp1 )1 Z) ,
(B8)
and its (1   ) 100% credible interval is
g (β'p )  z1 /2[ZT (XT C1X  Cp1 )1 Z]1/2 .
(B9)
The similar procedure can be applied to linearized nonlinear models, as shown in McLaughlin
and Townley [1996].
In classical regression, if the observations y can be simulated by a linear model y  Xβ  ε
with ε
Nn (0, C ) , and the prior information yβ on parameters β is available and represented as
y   β  ε  with errors ε 
N npri (0, C ) , then combining the two kinds of data information
gives the augmented linear model [Schweppe, 1973, p.104; Cooley, 1983; Hill and Tiedeman,
2007],
y  Xβ  ε ,
(B10)
y 
where y    is a vector of n observations of y and npri prior information of yβ; β is a vector
y  
X
of p unknown true model parameters; X    is a (n+npri)×p coefficient matrix with X
I 
representing the sensitivity of observations to parameters and the identity matrix I representing
ε 
the sensitivity of prior information to parameters; and ε    is a vector of (n+npri) errors
ε  
with ε representing errors of observations and εβ representing errors of prior information.
Assume the errors in prior information on the parameters are uncorrelated to the errors in
observations, we get
ε
N( n npri ) (0, C) ,
 C
with C  
 0
(B11)
0 
T
T
T
 where E (εε )  C , E(ε  ε  )  C and E(εε )  0 .
C 
For the linear model defined in equation (B10), linear regression parameter estimates b̂ are
obtained
by

minimizing


T
the
generalized
least-squares
objective
function

S (b)  y  Xb C1 y  Xb with respect to b, where b is a general vector of model parameters,

i.e., bˆ  XT C1X

1
XT C1y   XT C1X  C1 
1
X C
T
y  C1y   with specifying X , C , and
1

y . The estimates follow a multivariate Gaussian distribution [Toutenburg, 1982, p.52],
bˆ

 1
N p (β, ( XT C1X) 1 )  exp   bˆ  β
 2
 X C


1




X  C1  bˆ  β 
,
(B12)
which is equivalent to the posterior distribution of β in (B5) when the prior information y  is
equal to the mean of prior distribution β p , and the covariance of errors of the prior information
C  is equal to the covariance of the prior distribution C p . Under these conditions, the
confidence intervals based on the distribution of b̂ (B12) is numerically identical to the credible
intervals based on the posterior distribution of β (B5). As in (B9), the (1   ) 100% confidence
interval for linear model prediction, g(β)=Zβ, is
g (bˆ )  z1 /2 [ZT ( XT C1X  C1 )1 Z]1/2 .
(B13)
When prior information is included in evaluating parameter uncertainty, the posterior
covariance matrix,  XT C1X  C1  indicates that the matrix is inverse of sum of two measures
1
of information from data and prior (where the measure of information is the inverse of the
covariance matrix which is positive-semidefinite). This suggests that no matter how much
information brought by data, the a posteriori covariance matrix would not be greater than the a
priori [Box and Tiao, 1992, p. 17]. A common concern when using prior information and
observations is that they may be conceptually different, for example, due to scale issues. This
draws into question the integrated use of observations and prior information in both regression
and Bayesian methods. During parameter estimation, Hill and Tiedeman [2007, p. 288-289]
suggest putting more emphasis on observation data for the following two reasons: (1) experience
has shown that in many systems observations often can be measured more accurately than prior
information, and (2) the relation between observed and simulated values is usually more direct
than is the relation between prior information and model parameter values. However,
propagation of uncertainty can have different goals than does parameter estimation, and the
meaning of the prior information needs to be considered carefully. Further examination of this
issue is beyond the scope of the present work, which focuses on the comparison of the two
methods, not this difficulty shared by both.
Appendix C: Derivation of Linear Credible Intervals for Linear and Nonlinear Models
According to Berger [1985], if Jeffreys’ noninformative prior p (β)  constant is considered,
the posterior density of parameter β is determined solely by data y. Then Bayes’ theorem
p(β | y ) 
p(β | y) 
p(y | β) p(β)
can be written as
p(y )
c  exp[log p(y | β)]
 c  exp[log p(y)]dβ

exp[log p(y | β)]
 exp[log p(y)]dβ
.
(C1)
By considering a Taylor series expansion of log p (y | β) about b̂ (which maximizes log p (y | β) )
and retaining terms up to the second order, equation (C1) is approximated by
1


exp log p (y | bˆ )  (β  bˆ )T I (bˆ )(β  bˆ ) 
2


p (β | y ) 
1


T
 exp log p(y | bˆ )  2 (β  bˆ ) I (bˆ )(β  bˆ )  dβ
 1

exp   (β  bˆ )T I (bˆ )(β  bˆ ) 
 2


p /2
1/2
ˆ
(2 ) | I (b) |
(C2)
,
  2 log p(y | β) 
where I (bˆ )   
 , and p is the number of parameters. It leads directly to that
ββT

 β bˆ
the posterior density is multivariate normal, i.e., p(β | y)
y  Xβ  ε with ε


N p bˆ , [ I (bˆ )]1 . For a linear model
Nn (0, C ) , this distribution is exact, because derivatives of log p (y | β) of
1
1
orders higher than two are zeros. Given that bˆ   XT C1X  XT C1y and [ I (bˆ )]1   XT C1X  ,
parameter distribution p(β | y)
N p (bˆ ,( XT C1X)1 ) is obtained directly; so is the linear credible
interval of a linear prediction g (β ) . For a nonlinear model y  f (β)  ε with ε
Nn (0, C ) ,
because the higher-order derivatives are not necessarily zero, equation (C2) is an approximation
of posterior density p (β | y ) . Following model linearization f (b)  f (bˆ )  Xbˆ (b  bˆ ) , the
posterior density p (β | y ) is approximated by
β

N p bˆ , [ I (bˆ )]1

1
bˆ   XTbˆ C1XTbˆ  XTbˆ C1y
1
[ I (bˆ )]1   XTbˆ C1XTbˆ 
(C3)
.
If the nonlinear prediction function g (b ) also can be linearized by g (b)  g (bˆ )  ZTbˆ (b  bˆ ) , the
linear credible interval is the same as the linear confidence interval for nonlinear models.
Appendix D: Figure of Sensitivity and Residual Analysis for Model 3Z in Complex
Groundwater Test Case
(a)
25
Composite scaled sensitivity
Composite scaled sensitivity
25
20
15
10
5
0
K1
RCH
K3
KRB LAKERCH
K2
20
15
10
5
0
KV
VANI
(c)
K1
RCH
K3
KV
VANI
(d)
(b)
2
Weighted residual
2
Weighted residual
K2
3
3
1
0
-1
1
0
-1
-2
-2
-3
-20
KRB LAKERCH
Parameter name
Parameter name
-10
0
10
20
Weighted simulated value
30
40
-3
-20
-10
0
10
20
30
40
Weighted simulated value
Figure D1. Composite scaled sensitivity and weighted residuals versus weighted simulated
values for model 3Z calibrated using two ((a) - (b)) and eighteen ((c) - (d)) observations of
streamflow gain. Calibration data include hydraulic heads (o), flows (+), lake stage (*), and
measurements of net lake recharge (Δ). Residuals are mostly positive for weighted residuals
between 26 and 32 as bounded by the two vertical lines.
Appendix E: Prior Information Used for the Groundwater Model Parameters
Assuming that measurements of net lake recharge (LAKERCH) are always available in
practice, prior information of LAKERCH is used in this study for all the three groundwater
models, and the prior information of sixteen hydraulic conductivities (locations are shown in
Figure 3(d) of the article) is used only for model INT, because this information about parameters
are known before collecting the data. Due to the insensitivity of vertical anisotropy (VANI) in the
three models calibration, convergence of model calibration is difficult. To help the convergence,
prior information of VANI is used for regularization. This prior information is also used for
calculating confidence intervals. To make the calculation of credible interval in MCMC
consistent with that of confidence interval, the consistent prior distribution of VANI is assumed.
The details of the prior information for all the parameters in the three models are listed in the
following tables, where KRB represents hydraulic conductance of the riverbed; RCH represents
recharge rate; KV represents leakance of the confining unit; and K represents hydraulic
conductivity.
Table E1: Prior information of model parameters for model HO. U and N stand for uniform and
normal distributions.
Parameter
Prior distribution
Noninformative parameters
KRB
U (102, 105)
RCH
U (10-4, 10-2)
KV
U (10-4, 1)
K
U (10-2, 500)
Informative parameters
LAKERCH
N (0.000603, 0.0003)
VANI
N (2.5, 0.6)
Table E2: Prior information of model parameters for model 3Z. U and N stand for uniform and
normal distributions.
Parameter
Prior distribution
Noninformative parameters
KRB
U (102, 105)
RCH
U (10-4, 10-2)
KV
U (10-4, 1)
K1
U (10-2, 500)
K2
U (10-2, 103)
K3
U (10-2, 500)
Informative parameters
LAKERCH
N (0.000603, 0.0003)
VANI
N (2.5, 0.6)
Table E3: Prior information of model parameters for model INT. U and N stand for uniform and
normal distributions.
Parameter
Prior distribution
Noninformative parameters
KRB1
U (102, 105)
KRB2
U (102, 106)
KRB3
U (102, 105)
RCH
U (10-4, 10-2)
KV
U (10-4, 1)
KA
U (10-2, 500)
KB
U (10-2, 500)
KC
U (10-2, 500)
Informative parameters
K1
N (132, 26.4)
K2
N (100, 20)
K3
N (53, 10.6)
K4
N (104, 20.8)
K10
N (90, 18.0)
K14
N (42, 8.4)
K15
N (60, 12.0)
K18
N (54, 10.8)
K19
N (114, 22.8)
K20
N (82, 16.4)
K21
N (27, 5.4)
K22
N (52, 10.4)
K23
N (55, 11.0)
K24
N (67, 13.4)
K25
N (64, 12.8)
K26
N (57, 11.4)
LAKERCH
N (0.000603, 0.0003)
VANI
N (2.5, 0.6)
Take model HO as an example. Figure E1 indicates that the chosen uniform distributions for
the four model parameters are flat and have no influential impact on the posterior distribution,
suggesting that those are noninformative priors.
-4
x 10
2500
Normal Prior
Posterior
5000
Uniform Prior
Posterior
5
2000
Uniform Prior
Posterior
4000
4
PDF
PDF
3000
PDF
1500
3
1000
2000
2
500
1000
1
0
-0.5
0
0.5
1
1.5
LAKERCH
2
0
2000
2.5
4000
6000
KRB
-3
x 10
100
8000
0
10000
6
8
10
12
14
16
-4
x 10
1
Uniform Prior
Posterior
0.1
80
4
RCH
0.12
Uniform Prior
Posterior
2
Normal Prior
Posterior
0.8
0.08
PDF
PDF
0.6
PDF
60
0.06
40
0.4
0.04
20
0
0.2
0.02
0
0.01
0.02
0.03
KV
0.04
0.05
0
20
30
40
50
K
60
70
0
0
2
4
VANI
6
8
Figure E1: The prior and posterior probabilities for HO model parameters. The prior information
is plotted based on Table E1.
References
Berger, J. O. (1985), Statistical decision theory and Bayesian analysis, 2nd edition, SpringerVerlag, New York, 641pp.
Box, E. P., and G. C. Tiao (1992), Bayesian inference in statistical analysis, Wiley classics
library edition published, 588pp.
Cooley, R. L. (1983), Incorporation of prior information on parameters into nonlinear regression
groundwater flow models: 2. Applications, Water Resour. Res., 19(3), 662–676.
Hill, M. C., and C. R. Tiedeman (2007), Effective calibration of ground water models, with
analysis of data, sensitivities, predictions, and uncertainty, John Wiely & Sons, New
York, 480pp.
McLaughlin, D., and L. R. Townley (1996), A reassessment of the groundwater inverse problem,
Water Resour. Res., 32(5), 1131-1161.
Schweppe, F. C. (1973), Uncertainty dynamic systems, Prentice-Hall, Englewood Cliffs, N.J..
Toutenburg, H. (1982), Prior information in linear models, Wiely series in probability and
mathematical statistics, 215pp.