Download q 2

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

German tank problem wikipedia , lookup

Linear regression wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Module 2:
Nonlinear Regression
Outline Single response
•
•
•
•
Notation
Assumptions
Least Squares Estimation – Gauss-Newton Iteration, convergence
criteria, numerical optimization
Diagnostics
Properties of Estimators and Inference
Other estimation formulations – maximum likelihood and Bayesian
estimators
Dealing with differential equation models
•
And then on to multi-response…
•
•
•
CHEE824 Winter 2004
J. McLellan
2
Notation
Model:
random noise
component
Yi  f (xi , )   i
explanatory
variables – ith
run conditions
p-dimensional vector of parameters
 f (x1, ) 
f ( xi ,  )
the model equation is


with n experimental runs, we have
 f ( x 2 , ) 
 ( ) defines the expectation surface  ( )  


the nonlinear regression model is


Y   ( )  
 f (x n , )
Model specification involves form of equation
Model specification –
–
–
–
–
–
and parameterization
CHEE824 Winter 2004
J. McLellan
3
Example #1 (Bates and Watts, 1988)
Rumford data –
– Cooling experiment – grind cannon barrel with blunt bore, and then
monitor temperature while it cools
» Newton’s law of cooling – differential equation with exponential
solution
» Independent variable is t (time)
» Ambient T was 60 F
» Model equation
t
f (t , )  60  70e
» 1st-order dynamic decay
CHEE824 Winter 2004
J. McLellan
4
Rumford Example
• Consider two observations – 2-dimensional observation space
» At t=4, t=41 min
CHEE824 Winter 2004
J. McLellan
5
Parameter Estimation – Linear Regression Case
approximating observation vector
observations
residual
vector
y
yˆ  X̂
Expectation surface X
CHEE824 Winter 2004
J. McLellan
6
Parameter Estimation - Nonlinear Regression Case
approximating observation vector
observations
residual
vector
y
yˆ   (ˆ)
expectation surface  ( )
CHEE824 Winter 2004
J. McLellan
7
Parameter Estimation – Gauss-Newton Iteration
Least squares estimation – minimize
2
T
y   ( )  e e  S ( )
Iterative procedure consisting of:
1. Linearization about the current estimate of the parameters
2. Solution of the linear(ized) regression problem to obtain the next
parameter estimate
3. Iteration until a convergence criterion is satisfied
CHEE824 Winter 2004
J. McLellan
8
Linearization about a nominal parameter vector
Linearize the expectation function η(θ) in terms of the parameter
vector θ about a nominal vector θ0:
 ( )   (0 )  V0 (  0 )
  (0 )  V0
Sensitivity Matrix
-Jacobian of the expectation
function
-contains first-order
sensitivity information
CHEE824 Winter 2004
f (x1, ) 
 f (x1, )

 
 p 
1


 ( )
V0 





 T 0  f (x n , )
f (x n , ) 

 
 p 
1


J. McLellan
0
9
Parameter Estimation – Gauss-Newton Iteration
Iterative procedure consisting of:
1. Linearization about the current estimate of the parameters
y   ( (i ) )  V (i ) (i 1)
2.
Solution of the linear(ized) regression problem to obtain the next
parameter estimate update

3.
(i 1)
 (V
(i )T
(i ) 1
V ) V
(i )T
(y   ( (i ) ))
Iteration until a convergence criterion is satisfied
–
for example,
 (i 1)
 (i )
CHEE824 Winter 2004
 tol
J. McLellan
10
Parameter Estimation - Nonlinear Regression Case
approximating observation vector
observations
y
 ( (i ) )  V (i ) (i 1)
Tangent plane approximation
(i )
(i )
 (
CHEE824 Winter 2004
)  V 
J. McLellan
11
Quality of the Linear Approximation
… depends on two components:
1. Degree to which the tangent plane provides a good approximation to
the expectation surface
- the planar assumption
- related to intrinsic nonlinearity
2.
Uniformity of the coordinates on the expectation surface – uniform
coordinates
- the linearization implies a uniform coordinate system on the tangent
plane approximation – equal changes in a given parameter produce
equal sized increments on the tangent plane
- equal-sized increments in a given parameter may map to unequalsized increments on the expectation surface
CHEE824 Winter 2004
J. McLellan
12
Rumford Example
• Consider two observations – 2-dimensional observation space
» At t=4, t=41 min
θ=0
Non-uniformity in coordinates
Tangent plane approximation
θ = 10
CHEE824 Winter 2004
θ changed in increments of 0.025
J. McLellan
13
Rumford example
• Model function f (t , )  60  70e
• Dataset consists of 13 observations
t
• Exercise – sensitivity matrix?
» Dimensions?
CHEE824 Winter 2004
J. McLellan
14
Rumford example – tangent approximation
• At θ = 0.05,
Non-uniformity in coordinates
Note uniformity in
coordinates
on tangent plane
Tangent plane approximation
CHEE824 Winter 2004
J. McLellan
15
Rumford example – tangent approximation
• At θ = 0.7,
CHEE824 Winter 2004
J. McLellan
16
Parameter Estimation – Gauss-Newton Iteration
Parameter estimation after jth iteration:
 ( j )   ( j 1)   ( j )
Convergence
–
can be declared by looking at:
» relative progress in the parameter estimate
»
 (i )
 tol
relative progress in reducing the sum of squares function
S ( (i 1) )  S ( (i ) )
S ( (i ) )
»
 (i 1)
 tol
combination of both progress in sum of squares reduction and progress
in parameter estimates
CHEE824 Winter 2004
J. McLellan
17
Parameter Estimation – Gauss-Newton Iteration
Convergence
–
–
the relative change criteria in sum of squares or parameter estimates
terminate on lack of progress, rather than convergence (Bates and Watts,
1988)
alternative – due to Bates and Watts, termed the relative offset criterion
» we will have converged to the true optimum (least squares estimates)
if the residual vector is orthogonal to the nonlinear expectation
surface, and in particular, its tangent plane approximation at the true
parameter values
» if we haven’t converged, the residual vector
won’t necessarily be orthogonal to the tangent plane at the current
parameter iterate
e  y   ( )
CHEE824 Winter 2004
J. McLellan
18
Parameter Estimation – Gauss-Newton Iteration
Convergence
»
declare convergence by comparing component of residual vector
lying on tangent plane to the component orthogonal to the tangent
plane – if the component on the tangent plane is small, then we are
close to orthogonality  convergence
»
Note also that after each iteration, the residual vector is orthogonal to
the tangent plane computed at the previous parameter iterate (where
the linearization is conducted), and not necessarily to the tangent
plane and expectation surface at the most recently computed
parameter estimate
Q1T (y   ( (i ) )) / p
QT2 (y   ( (i ) )) / N  p
CHEE824 Winter 2004
J. McLellan
19
Computational Issues in Gauss-Newton Iteration
The Gauss-Newton iteration can be subject to poor numerical conditioning,
as the linearization is recomputed at new parameter iterates
» Conditioning problems arise in inversion of VTV
» Solution – use a decomposition technique
• QR decomposition
• Singular Value Decomposition (SVD)
» Decomposition techniques will accommodate changes in rank of the
Jacobian (sensitivity) matrix V
CHEE824 Winter 2004
J. McLellan
20
QR Decomposition
An n x p matrix V takes vectors from a p-dimensional space into an
n-dimensional space
M
N
V
n-dimensional
e.g., n=3
p-dimensional
e.g., p=2
CHEE824 Winter 2004
J. McLellan
21
QR Decomposition
•
•
The columns of the matrix V (viewed as a linear mapping) are the
images of the basis vectors for the domain space (M) expressed in the
basis of the range space (N)
If M is a p-dimensional space, and N is an n-dimensional space (with
p<n), then V defines a p-dimensional linear subspace in N as long as V
is of full rank
– Think of our expectation plane in the observation space for the linear
regression case – the observation space is n-dimensional, while the
expectation plane is p-dimensional where p is the number of parameters
•
We can find a new basis for the range space (N) so that the first p basis
vectors span the range of the mapping V, and the remaining n-p basis
vectors are orthogonal to the range space of V
CHEE824 Winter 2004
J. McLellan
22
QR Decomposition
•
•
•
In the new range space basis, the mapping will have zero elements in
the last n-p elements of the mapping vector since the last n-p basis
vectors are orthogonal to the range of V
By construction, we can express V as an upper-triangular matrix
This is a QR decomposition
V  QR
R1 
 q1 q 2  q n  
0
CHEE824 Winter 2004
J. McLellan
23
QR Decomposition
•
1  1


X  1 0 
1 1 


Example – linear regression with
y3
β2
X
1

1
1

y2
β1
y1
CHEE824 Winter 2004
 1
 
0
1
 
Perform QR decomposition X  QR
J. McLellan
24
QR Decomposition
•
In the new basis, the expectation plane becomes
0 
 1.7032


~
X  QT X  R   0
 1.4142
 0
0 

z3
β2
  1 .7 


 0 
 0 


 0 


  1 .4 
 0 


β1
z2
z1
CHEE824 Winter 2004
J. McLellan
25
QR Decomposition
•
The new basis for the range space is given by the columns of Q
 0.5774  0.7071 0.4082 


Q   0.5774
0
 0.8165
 0.5774 0.7071
0.4082 

Visualize the new basis vectors
for the observation space relative
to the original basis
y3
q2
q3
z1 is distance along q1,
z2 is distance along q2,
z3 is distance along q3
y2
q1
y1
CHEE824 Winter 2004
J. McLellan
26
QR Decomposition
•
In the new coordinates,
z3
z1 is distance along q1,
z2 is distance along q2,
 0 


  1 .4 
 0 


z3 is distance along q3
  1 .7 


 0 
 0 


z2
z1
CHEE824 Winter 2004
J. McLellan
27
QR Decomposition
There are various ways to compute a QR decomposition
– Gram-Schmidt orthogonalization – sequential orthogonalization
– Householder transformations – sequence of reflections
CHEE824 Winter 2004
J. McLellan
28
QR Decompositions and Parameter Estimation
How does QR decomposition aid parameter estimation?
– QR decomposition will identify the effective rank of the estimation problem
through the process of computing the decomposition
» # of vectors spanning range space of V is the effective dimension of the
estimation problem
» If dimension changes with successive linearizations, QR decomposition
will track this change
» Reformulating the estimation problem using a QR decomposition
improves the numerical conditioning and ease of solution for the
problem
» Over-constrained problem: e.g., for the linear regression case, find β to
come as close to satisfying
Y  X
 QR
CHEE824 Winter 2004
R1 
Q Y  R    
0
T
J. McLellan
29
QR Decompositions and Parameter Estimation
•
•
R1 is upper-triangular, and so the parameter estimates can be obtained
sequentially
The Gauss-Newton iteration follows the same pattern
» Perform QR decompositions on each V
•
QR decomposition also plays an important role in understanding
nonlinearity
» Look at second-derivative vectors and partition them into components
lying in the tangent plane (associated with tangential curvature) and
those lying orthogonal to the tangent plane (associated with intrinsic
curvature)
» QR decomposition can be used to construct this partitioning
• First p vectors span the tangent plane, remaining are orthogonal to
it
CHEE824 Winter 2004
J. McLellan
30
Singular Value Decomposition
• Singular value decompositions (SVDs) are similar to eigenvector
decompositions for matrices
• SVD:
X  UV
T
Where
» U is the “output rotation matrix”
» V is the “input rotation matrix” (pls don’t confuse with Jacobian!)
» Σ is a diagonal matrix of singular values
CHEE824 Winter 2004
J. McLellan
31
Singular Value Decomposition
•
•
•
•
•
Singular values:

 i  i XT X

i.e., the positive square root of the eigenvalues of XTX, which is square
(will be pxp, where p is the number of parameters)
Input singular vectors form the columns of V, and are the eigenvectors
of XTX
Output singular vectors form the columns of U, and are the
eigenvectors of X XT
One perspective – find new bases for the input space (parameter
space) and output space (observation space) in which X becomes a
diagonal matrix – only performs scaling, no rotation
For parameter estimation problems, U will be nxn, and V will be pxp; Σ
will be nxp
CHEE824 Winter 2004
J. McLellan
32
SVD and Parameter Estimation
•
•
SVD will accommodate effective rank of the estimation problem, and can
track changes in the rank of the problem
» Recent work tries to alter the dimension of the problem using
SVD information
SVD can improve the numerical conditioning and ease of solution of the
problem
CHEE824 Winter 2004
J. McLellan
33
Other numerical estimation methods
• Focus on minimizing the sum of squares function using
optimization techniques
• Newton-Raphson solution
– Solve for increments using second-order approximation of sum of
squares function
• Levenberg-Marquardt compromise
– Modification of the Gauss-Newton iteration, with introduction of
factor to improve conditioning of linear regression step
• Nelder-Mead
– Pattern search method – doesn’t use derivative information
• Hybrid approaches
– Use combination of derivative-free and derivative-based methods
CHEE824 Winter 2004
J. McLellan
34
Other numerical estimation methods
• In general, the least squares parameter estimation approach
represents a minimization problem
• Use optimization technique to find parameter estimates to
minimize the sum of squares of the residuals
CHEE824 Winter 2004
J. McLellan
35
Newton-Raphson approach
• Start with the residual sum of squares function S(θ) and form the
2nd-order Taylor series expansion:
S ( )  S (
(i )
)
S ( )

T
(  
(i )
 (i )
where H is the Hessian of S(θ):
1
)  (   (i ) )T H(   (i ) )
2
H
» the Hessian is the multivariable secondderivative for a function of a vector
 2 S ( )
  T  (i )
• Now solve for the next move by applying the stationarity condition
(take 1st derivative, set to zero)
(   (i ) )  H 1
CHEE824 Winter 2004
J. McLellan
S ( )
 T  (i )
36
Hessian
• Is the matrix of second derivatives –
(consider using Maple to generate!)
H
 2 S ( )
  T  (i )
CHEE824 Winter 2004
  2 S ( )

2
  1
  2 S ( )

 1  2

 

 2
  S ( )
 1  p

 2 S ( )
1  2

 2 S ( )
 22


J. McLellan
 2 S ( )
 p 1  p
 2 S ( ) 

1  p 




 2 S ( ) 

 p 1  p 

 2 S ( ) 
 2p  (i )

37
Jacobian and Hessian of S(θ)
• Can be found by the chain rule:
S ( )

H
 2
T
 2 S ( )
 
T
 2 ( )
 
T
 2
 2
 ( )

T
(y   ( ))
 2 ( )
 
T
(y   ( ))  2
(y   ( ))  2V V
T
3-dimensional array
(tensor)
CHEE824 Winter 2004
J. McLellan
the sensitivity matrix
that we had before: V
 T ( )  ( )
 T
 T
Often used as an
approximation of the
Hessian – “expected
value of the Hessian”
38
Newton-Raphson approach
• Using the approximate Hessian (which is always positive semidefinite), the change in parameter estimate is:
(   (i ) )  H 1
S ( )
 T  (i )
 (VT V ) 1 VT (y   ( ))
where V is evaluated at θ(i) is the sensitivity matrix.
• This is the Gauss-Newton iteration!
• Issues – computing and updating the Hessian matrix
» Potential better progress – information about curvature
» Hessian can cease to be positive definite (required in order for
stationary point to be a minimum)
CHEE824 Winter 2004
J. McLellan
39
Levenberg-Marquardt approach
• Improve the conditioning of the inverse by adding a factor –
biased regression solution –
• Levenberg modification
(i 1)
(i )T (i )

 (V
V
1
 I p ) V
(i )T
where Ip is the pxp identity matrix
• Marquardt modification
(i 1)
(i )T (i )
1 (i )T

 (V
V
 D) V
(y   ( (i ) ))
(y   ( (i ) ))
where D is a matrix containing the diagonal entries of VTV
• If λ -> 0, approach Gauss-Newton iteration
• If λ -> ∞, approach direction of steepest ascent – optimization
technique
CHEE824 Winter 2004
J. McLellan
40
Inference – Joint Confidence Regions
•
•
Approximate confidence regions for parameters and predictions can be
obtained by using a linearization approach
Approximate covariance matrix for parameter estimates:
ˆ TV
ˆ ) 1 2
ˆ  (V
•
•
where V̂ denotes the Jacobian of the expectation mapping evaluated at
the least squares parameter estimates
This covariance matrix is asymptotically the true covariance matrix for
the parameter estimates as the number of data points becomes infinite
100(1-α)% joint confidence region for the parameters:
ˆ TV
ˆ (  ˆ)  p s2 Fp,n  p,
(  ˆ)T V
» compare to the linear regression case
CHEE824 Winter 2004
J. McLellan
41
Inference – Marginal Confidence Intervals
• Marginal confidence intervals
» Confidence intervals on individual parameters
ˆi  t , / 2 sˆ
i
sˆ is the approximate standard error of the parameter
where
estimate –i i-th diagonal element of the approximate parameter
estimate covariance matrix, with noise variance estimated as in
the linear case
ˆ TV
ˆ ) 1 s2
ˆ  (V
CHEE824 Winter 2004
J. McLellan
42
Inference – Predictions & Confidence Intervals
• Confidence intervals on predictions of existing points in the
dataset
– Reflect propagation of variability from the parameter estimates to the
predictions
– Expressions for nonlinear regression case based on linear
approximation and direct extension of results for linear regression
First, let’s review the linear regression case…
CHEE824 Winter 2004
J. McLellan
43
Precision of the Predicted Responses - Linear
From the linear regression module (module 1) –
The predicted response from an estimated model has uncertainty, because
it is a function of the parameter estimates which have uncertainty:
e.g., Solder Wave Defect Model - first response at the point -1,-1,-1
y1  0  1( 1)  2 ( 1)  3( 1)
If the parameter estimates were uncorrelated, the variance of the predicted
response would be:
Var ( y1)  Var ( 0 )  Var ( 1)  Var ( 2 )  Var ( 3)
(recall results for variance of sum of random variables)
CHEE824 Winter 2004
J. McLellan
44
Precision of the Predicted Responses - Linear
In general, both the variances and covariances of the parameter estimates
must be taken into account.
For prediction at the k-th data point:
Var ( yˆ k )  xTk ( XT X) 1 x k  2

 xk 1
 xk 1 


x
T
1  k 2  2
xk 2  xkp ( X X) 





 xkp 



Note - Var ( yˆ k )  xTk ( XT X) 1 x k 2  xTk  ˆ x k

CHEE824 Winter 2004
J. McLellan
45
Precision of the Predicted Responses - Nonlinear
Linearize the prediction equation about the least squares estimate:
f (x k , )
ˆ
ˆ)  f (x k ,ˆ)  vTk (  ˆ)
yˆ k  f (x k , ) 
(



 T ˆ
For prediction at the k-th data point:
ˆ TV
ˆ ) 1 vˆ k 2
Var ( yˆ k )  vˆ Tk (V
 vˆk1 
 
vˆk 2 

T

1
2
ˆ V
ˆ)
 vˆk1 vˆk 2  vˆkp (V


  
 
 vˆ 
 kp 


T ˆ T ˆ 1
V) vˆ k 2  vˆ Tk ˆ vˆ k
Note - Var ( yˆ k )  vˆ k (V
CHEE824 Winter 2004
J. McLellan
46
Estimating Precision of Predicted Responses
Use an estimate of the inherent noise variance
s 2yˆ  xTk ( XT X) 1 x k s2
k
linear
s 2yˆ  vTk (VT V ) 1 v k s2
k
nonlinear
The degrees of freedom for the estimated variance of the predicted
response are those of the estimate of the noise variance
» replicates
» external estimate
» MSE
CHEE824 Winter 2004
J. McLellan
47
Confidence Limits for Predicted Responses
Linear and Nonlinear Cases:
Follow an approach similar to that for parameters - 100(1-α)% confidence
limits for predicted response at the k-th run are:
yk  t , / 2 syk
» degrees of freedom are those of the inherent noise variance
estimate
If the prediction is for a response at conditions OTHER than one of the
experimental runs, the limits are:
yˆ k  t , / 2 s 2yˆ  se2
k
CHEE824 Winter 2004
J. McLellan
48
Precision of “Future” Predictions - Explanation
Suppose we want to predict the response at conditions other than those of
the experimental runs --> future run.
The value we observe will consist of the component from the deterministic
component, plus the noise component.
In predicting this value, we must consider:
» uncertainty from our prediction of the deterministic component
» noise component
The variance of this future prediction is Var ( y
ˆ )   2
where Var ( yˆ ) is computed using the same expression
for variance of predicted responses at experimental run conditions
- For linear case, with x containing specific run conditions,
Var ( yˆ )  xT ( XT X) 1 x 2  xT  ˆ x
CHEE824 Winter 2004
J. McLellan
49
Properties of LS Parameter Estimates
Key Point - parameter estimates are random variables
» because of how stochastic variation in data propagates through
estimation calculations
» parameter estimates have a variability pattern - probability
distribution and density functions
Unbiased
E{}  
» “average” of repeated data collection / estimation sequences will
be true value of parameter vector
CHEE824 Winter 2004
J. McLellan
50
Properties of Parameter Estimates
Consistent
» behaviour as number of data points tends to infinity
» with probability 1,
lim   
N 
» distribution narrows as N becomes large
Efficient
» variance of least squares estimates is less than that of other
types of parameter estimates
CHEE824 Winter 2004
J. McLellan
51
Properties of Parameter Estimates
Linear Regression Case
– Least squares estimates are –
» Unbiased
» Consistent
» Efficient
Nonlinear Regression Case
– Least squares estimates are –
» Asymptotically unbiased – as number of data points becomes
infinite
» Consistent
» efficient
CHEE824 Winter 2004
J. McLellan
52
Maximum Likelihood Estimation
Concept –
• Start with function which describes likelihood of data given
parameter values
» Probability density function
• Now change perspective – assume that data observed are the
most likely, and find parameter values to make the data the most
likelihood
» Likelihood of parameters given observed data
• Estimates are “maximum likelihood” estimates
CHEE824 Winter 2004
J. McLellan
53
Maximum Likelihood Estimation
• For Normally distributed data (random shocks)
• Recall that for a given run, we have
Yi  f (xi , )   i ,  i ~ N (0, 2 )
• Probability density function for Yi:
» Mean is given by f(xi,θ), and variance is
 2
 1
2
fYi ( y ) 
exp 
( y  f (xi , )) 
2  
 2 

1
CHEE824 Winter 2004
J. McLellan
54
Maximum Likelihood Estimation
• With n observations, given that the responses are independent
(since the random shocks are independent), the joint density
function for the observations is simply the product of the
individual density functions:
 1

exp  
( yi  f (xi , ))2 
 2 

i 1 2  
 1 n
1
2

exp 
 ( yi  f (xi , )) 
n/2 n
(2 )  
 2  i 1

n
fY1Yn ( y1, yn )  
CHEE824 Winter 2004
1
J. McLellan
55
Maximum Likelihood Estimation
• In shorthand, using vector notation for the observations, and now
explicitly acknowledging that we “know”, or are given, the
parameter values:
 1 n
1
2
f Y (y |  ,  ) 
exp 
 ( yi  f (xi , )) 
n/2 n
(2 )  
 2  i 1


 1

T
 

exp
(
y


(

))
(
y


(

))
n/2 n
(2 )  
 2 

1
Note that we have written the sum of squares in vector notation
as well, using the expectation mapping.
• Note also that the random noise standard deviation is also a
parameter
CHEE824 Winter 2004
J. McLellan
56
Likelihood Function
• Now, we have a set of observations, which we will assume are
the most likely, and we now define the likelihood function:
 1 n
2
l ( ,   | y ) 
exp  
 ( yi  f (xi , )) 
n/2 n
(2 )  
 2  i 1

1
 1

T

exp 
(y   ( )) (y   ( )) 
n/2 n
(2 )  
 2 

1
CHEE824 Winter 2004
J. McLellan
57
Log-likelihood function
• We can also work with the log-likelihood function, which extracts
the important part of the expression from the exponential:
L( ,  | y )   n ln(  ) 
  ln(  ) 
CHEE824 Winter 2004
1
2 
1
n
2
 ( yi  f (xi , ))
2  i 1
(y   ( ))T (y   ( ))
J. McLellan
58
Maximum Likelihood Parameter Estimates
• Formal statement as optimization problem:
 1 n
2
max l ( ,  | y )  max
exp 
 ( yi  f (xi , )) 
n
/
2
n
 , 
 ,  (2 )

 2  i 1

1
 1

T
 max
exp 
(y   ( )) (y   ( )) 
n
/
2
n
 ,  (2 )

 2 

1
CHEE824 Winter 2004
J. McLellan
59
Maximum Likelihood Estimation
• Examine the likelihood function:
 1 n
2
l ( ,   | y ) 
exp  
 ( yi  f (xi , )) 
n/2 n
(2 )  
 2  i 1

1

 1

T
 

exp
(
y


(

))
(
y


(

))
n/2 n
(2 )  
 2 

1
• Regardless of the noise standard deviation, the likelihood
function will be maximized by those parameter values minimizing
the sum of squares between the observed data and the model
predictions
» These are the parameter values that make the observed data the
“most likely”
CHEE824 Winter 2004
J. McLellan
60
Maximum Likelihood Estimation
• In terms of the residual sum of squares function, we have the
likelihood function:
 1

l ( ,  | y ) 
exp 
S ( ) 
n/2 n
(2 )  
 2 

1
and the log-likelihood function:
L( ,  | y )   n ln(  ) 
CHEE824 Winter 2004
J. McLellan
1
2 
S ( )
61
Maximum Likelihood Estimation
• We can obtain the optimal parameter estimates separately from
the noise standard deviation, given the form of the likelihood
function
» Minimize sum of squares of residuals – not a function of noise
standard deviation
• For Normally distributed data, the maximum likelihood parameter
estimates are the same as the least squares estimates for
nonlinear regression
• The maximum likelihood estimate for the noise variance is the
mean squared error (MSE),
2 S (ˆ)
s 
n
» Obtain by taking derivative with respect to the variance, and then
solving
CHEE824 Winter 2004
J. McLellan
62
Maximum Likelihood Estimation
Further comments:
• We could develop the likelihood function starting with the
distribution of the random shocks, ε, producing the same
expression
• If the random shocks were independent, but had a different
distribution, then the observations would also have a different
distribution
» Expectation function defines means of this distribution
n
fY1Yn ( y1, yn |  )   g ( yi ; xi , )
i 1
where g is the individual density function
» Could then develop a likelihood function from this density fn.
CHEE824 Winter 2004
J. McLellan
63
Inference Using Likelihood Functions
• Generate likelihood regions – contours of the likelihood function
» Choice of contour value comes from examining distribution
• Unlike the least squares approximate inference regions, which
were developed using linearizations, the likelihood regions need
not be elliptical or ellipsoidal
» Can have banana shapes, or can be open contours
• Likelihood regions – first, examine the likelihood function:
 1

l ( ,  | y ) 
exp 
S ( ) 
n/2 n
(2 )  
 2 

1
– The dependence of the likelihood function on the parameters is
through the sum of squares function S(θ)
CHEE824 Winter 2004
J. McLellan
64
Likelihood regions
S ( )  S (ˆ)
p
~ Fp,n  p
ˆ
S ( )
n p
• Focusing on S(θ), we have
– Note that the denominator is the MSE – residual variance
• This is an asymptotic result in the nonlinear case, and an exact
result for the linear regression case
• We can generate likelihood regions as values of θ such that
S ( )  S (ˆ)[1 
CHEE824 Winter 2004
p
Fp,n  p ]
n p
J. McLellan
65
Likelihood regions – further comments
• The likelihood regions are essentially sums of squares contours
– Specifically for case where data are Normally distributed
• In the nonlinear regression case,
ˆ TV
ˆ (  ˆ)
S ( )  S (ˆ)  (  ˆ)T V
and so the likelihood contours are approximated by the
linearization-based approximate joint confidence region from
least squares theory
ˆ TV
ˆ (  ˆ)  p s2 Fp,n  p,
(  ˆ)T V
CHEE824 Winter 2004
J. McLellan
66
Likelihood regions – further comments
• Using
S ( )  S (ˆ)[1 
p
Fp,n  p ]
n p
is an approximate approach that approximates the exact
likelihood region
– Approximation is in the sampling distribution argument used to derive
the expression in terms of the F distribution
– This is asymptotically (as the number of data points becomes infinite)
an exact likelihood region
• In general, an exact likelihood region would be given by
S ( )  c S (ˆ)
for some appropriately chosen constant “c”
– Note that in the approximation,
CHEE824 Winter 2004
p
c  [1 
Fp,n  p ]
n p
J. McLellan
67
Likelihood regions further comments
• In general, the difficulty in using
S ( )  c S (ˆ)
lies in finding a value of “c” that gives the correct coverage
probability
– The coverage probability is the probability that the region contains
the true parameter values
– The approximate result using the F-distribution is an attempt to get
such a coverage probability
– The likelihood contour is reported to give better coverage
probabilities for smaller data sets, and is less affected by nonlinearity
» Donaldson and Schnabel(1987)
CHEE824 Winter 2004
J. McLellan
68
Likelihood regions - Examples
• Puromycin – from Bates and
Watts (untreated cases)
– Red is 95% likelihood region
– Blue is 95% confidence region
(linear approximation)
– Note some difference in shape,
orientation and size, but not
too pronounced
– Square indicates least squares
estimates
– Maple worksheet available
on course web
CHEE824 Winter 2004
J. McLellan
69
Likelihood Regions - Examples
• BOD – from Bates and Watts
– Red is 95% likelihood region
– Blue is 95% confidence
region (linear approximation)
– Note significant difference
in shapes
– Note that confidence ellipse
includes the value of 0 for
θ2
– Square indicates least
squares estimates
– Maple worksheet available
on course web
CHEE824 Winter 2004
J. McLellan
70
Bayesian estimation
Premise –
– The distribution of observations is characterized by parameters
which in turn have some distribution of their own
– Concept of prior knowledge of the values that the parameters might
assume
• Model
Y   ( )  
• Noise characteristics
 ~ i.i.d . N (0, 2 )
• Approach – use Bayes’ theorem
CHEE824 Winter 2004
J. McLellan
71
Conditional Expectation
Recall conditional probability:
P( X  Y )
P( X | Y ) 
P(Y )
» probability of X given Y, where X and Y are events
For continuous random variables, we have a conditional probability
density function expressed in terms of the joint and marginal
distribution functions:
f XY ( x, y )
f X |Y ( x | y ) 
fY ( y )
Note - Using this, we can also define the conditional expectation of X
given Y:

E{ X | Y }   x f X |Y ( x | y )dx

CHEE824 Winter 2004
J. McLellan
72
Bayes’ Theorem
• useful for situations in which we have incomplete probability
knowledge
• forms basis for statistical estimation
• suppose we have two events, A and B
• from conditional probability:
P( A  B)  P( A | B) P( B)  P( B  A)  P( B | A) P( A)
so
P( B | A) P( A)
P( A | B) 
P( B)
for P(B)>0
CHEE824 Winter 2004
J. McLellan
73
Bayesian Estimation
• Premise – parameters can have their own distribution – prior
distribution f ( ,  )
• The posterior distribution of the parameters can be related to the
prior distribution of the parameters and the likelihood function:
f ( , , y )
f ( ,  | y ) 
f ( y)
f ( y |  ,  ) f ( ,  )

f ( y)
 f ( y |  ,  ) f ( ,  )
CHEE824 Winter 2004
f ( , | y ) Posterior
Distribution
- of parameters given
data
J. McLellan
74
Bayesian Estimation
• The noise standard deviation σ is a nuisance parameter, and we
can focus instead on the model parameters:
f ( | y )  f ( y |  ) f ( )
• How are the posterior distributions with/without σ related?
f ( | y)   f ( , | y )d
CHEE824 Winter 2004
J. McLellan
75
Bayesian estimation
• Bayes’ theorem
• Posterior density function in terms of prior density function
• Equivalence for normal with uniform prior – least squares /
maximum likelihood estimates
• Inference – posterior density regions
CHEE824 Winter 2004
J. McLellan
76
Diagnostics for nonlinear regression
• Similar to linear case
• Qualitative – residual plots
– Residuals vs.
» Factors in model
» Sequence (observation) number
» Factors not in model (covariates)
» Predicted responses
– Things to look for:
» Trend remaining
» Non-constant variance
» Meandering in sequence number – serial correlation
• Qualitative – plot of observed and predicted responses
– Predicted vs. observed – slope of 1
– Predicted and observed – as function of independent variable(s)
CHEE824 Winter 2004
J. McLellan
77
Diagnostics for nonlinear regression
• Quantitative diagnostics
– Ratio tests:
» MSR/MSE – as in the linear case – coarse measure of significant
trend being modeled
» Lack of fit test – if replicates are present
• As in linear case – compute lack of fit sum of squares, error
sum of squares, compare ratio
» R-squared
• coarse measure of significant trend
• squared correlation of observed and predicted values
• adjusted R-squared
• squared correlation of observed and predicted values
CHEE824 Winter 2004
J. McLellan
78
Diagnostics for nonlinear regression
• Quantitative diagnostics
– Parameter confidence intervals:
» Examine marginal intervals for parameters
• Based on linear approximations
• Can also use hypothesis tests
» Consider dropping parameters that aren’t statistically significant
» Issue in this case – parameters are more likely to be involved in
more complex expression involving factors, parameters
• E.g., Arrhenius reaction rate expression
» If possible, examine joint confidence regions, likelihood regions,
HPD regions
• Can also test to see if a set of parameter values lie in a
particular region squared correlation of observed and
predicted values
CHEE824 Winter 2004
J. McLellan
79
Diagnostics for nonlinear regression
• Quantitative diagnostics
– Parameter estimate correlation matrix:
» Examine correlation matrix for parameter estimates
• Based on linear approximation
• Compute covariance matrix, then normalize using pairs of
standard deviations
» Note significant correlations and keep these in mind when
retaining/deleting parameters using marginal significance tests
» Significant correlation between some parameter estimates may
indicate over-parameterization relative to the data collected
• Consider dropping some of the parameters whose estimates
are highly correlated
• Further discussion – Chapter 3 - Bates and Watts (1988),
Chapter 5 - Seber and Wild (1988)
CHEE824 Winter 2004
J. McLellan
80
Practical Considerations
• Convergence –
– “tuning” of estimation algorithm – e.g., step size factors
– Knowledge of the sum of squares (or likelihood or posterior density)
surface – are there local minima?
» Consider plotting surface
– Reparameterization
• Ensuring physically realistic parameter estimates
– Common problem – parameters should be positive
– Solutions
» Constrained optimization approach to enforce non-negativity of
parameters
positive
  exp( )
» Reparameterization – for example
positive
  10

CHEE824 Winter 2004
J. McLellan
1
1 e

Bounded between
0 and 1
81
Practical considerations
• Correlation between parameter estimates
– Reduce by reparameterization
– Exponential example –
1 exp(  2 x)
1 exp(  2 ( x  x0  x0 ))
 1 exp(  2 x0 ) exp(  2 ( x  x0 ))
 1 exp( 2 ( x  x0 ))
CHEE824 Winter 2004
J. McLellan
82
Practical considerations
• Particular example – Arrhenius rate expression
 E 1
1
1 
 E 

k0 exp 

)
  k0 exp  ( 

R
T
T
T
 RT 
ref
ref




 E 1

E
1
 exp  ( 
 k0 exp 
)
 RTref 
 R T Tref 




 E 1

1
 kref exp  ( 
)
 R T Tref 


– Effectively reaction rate relative to reference temperature
– Reduces correlation between parameter estimates and improves
conditioning of estimation problem
CHEE824 Winter 2004
J. McLellan
83
Practical considerations
• Scaling – of parameters and responses
• Choices
– Scale by nominal values
» Nominal values – design centre point, typical value over range,
average value
– Scale by standard errors
» Parameters – estimate of standard devn of parameter estimate
» Responses – by standard devn of observations – noise standard
deviation
– Combinations – by nominal value / standard error
• Scaling can improve conditioning of the estimation problem (e.g.,
scale sensitivity matrix V), and can facilitate comparison of terms
on similar (dimensionless) bases
CHEE824 Winter 2004
J. McLellan
84
Practical considerations
• Initial guesses
–
–
–
–
From prior knowledge
From prior results
By simplifying model equations
By exploiting conditionally linear parameters – fix these, estimate
remaining parameters
CHEE824 Winter 2004
J. McLellan
85
Dealing with heteroscedasticity
• Problem it poses – precision of parameter estimates
• Weighted least squares estimation
• Variance stabilizing transformations – e.g., Box-Cox
transformations
CHEE824 Winter 2004
J. McLellan
86
Estimating parameters in differential equation models
• Model is now described by a differential equation:
dy
 f ( y, u, t; ); y (t0 )  y0
dt
• Referred to as “compartment models” in the biosciences.
• Issues –
– Estimation – what is the effective expectation function here?
» Integral curve or flow (solution to differential equation)
– Initial conditions – known?, unknown and estimated?, fixed
(conditional estimation)?
– Performing Gauss-Newton iteration
» Or other numerical approach
– Solving differential equation
CHEE824 Winter 2004
J. McLellan
87
Estimating parameters in differential equation models
What is the effective expectation function here?
– Differential equation model:
dy
 f ( y, u, t; ); y (t0 )  y0
dt
– y – response, u – independent variables (factors), t – becomes a
factor as well
– Expectation function is the solution to the differential equation, which
is evaluated at different times at which observations are taken
i ( )  y(ti , ui ; , y0 )
– Note implicit dependence on initial conditions, which may be
assumed or estimated
– Often this is a conceptual model and not an analytical solution – the
solution is often the numerical solution at specific times - subroutine
CHEE824 Winter 2004
J. McLellan
88
Estimating parameters in differential equation models
• Expectation mapping
1 ( )   y (t1, u1; , y0 ) 

 


(

)
y
(
t
,
u
;

,
y
)
0 
 2   2 2
 ( )  


  


 

 ( )  y (t , u ; , y )
n
n
n
0
• Random noise – is assumed to be additive on the observations
 Y1  1 ( )   1 
  
  
Y

(

)
 2   2   2 
      
  
  
Y   ( )  
 n  n   n
CHEE824 Winter 2004
Y   ( )  
J. McLellan
89
Estimating parameters in differential equation models
Estimation approaches
– Least squares (Gauss-Newton/Newton-Raphson iteration), maximum
likelihood, Bayesian
– Will require sensitivity information – sensitivity matrix V
 y (t1, u1; , y0 ) 



 y (t , u ; , y ) 
2 2
0
 

V ( ) 



 



 y (tn , u n ; , y0 ) 



How can we get sensitivity information without having an explicit
solution to the differential equation model?
CHEE824 Winter 2004
J. McLellan
90
Estimating parameters in differential equation models
Sensitivity equations
– We can interchange the order of differentiation in order to obtain the
sensitivity differential equations – referred to as sensitivity equations
 dy d y f ( y, u, t ; ) y f ( y, u, t ; )



 dt dt 
y


y (t0 ) y0



– Note that the initial condition for the response may also be a function
of the parameters – e.g., if we assume that the process is initially at
steady-state  parametric dependence through steady-state form of
model
– These differential equations are solved to obtain the parameter
sensitivities at the necessary time points: t1, …tn
CHEE824 Winter 2004
J. McLellan
91
Estimating parameters in differential equation models
Sensitivity equations
– The sensitivity equations are coupled with the original model
differential equations – for the single differential equation (and
response) case, we will have p+1 simultaneous differential
equations, where p is the number of parameters
dy
 f ( y , u , t ; )
dt
d y f ( y, u, t ; ) y f ( y, u, t ; )


dt 1
y
1
1
d y f ( y, u, t ; ) y f ( y, u, t ; )


dt  2
y
 2
 2

d y
f ( y, u, t ; ) y f ( y, u, t ; )


dt  p
y
 p
 p
CHEE824 Winter 2004
J. McLellan
92
Estimating parameters in differential equation models
Variations on single response differential equation models
– Single response differential equation models need not be restricted
to single differential equations
– We really have a single measured output variable, and multiple
factors
» Control terminology – multi-input single-output (MISO) system
Differential
equation model
x  f (x, u, t ; ); x(t0 )  x0
y  h(x, u, t ; )
Sensitivity d x f ( x, u, t ; ) x f ( x, u, t ; ) x(t0 ) x 0

;

, i  1,, p
equations dt  
x
i
i
i
i
i
y h(x, u, t ; ) x h(x, u, t ; )



x


CHEE824 Winter 2004
J. McLellan
93
Estimating parameters in differential equation models
Options for solving the sensitivity equations –
– Solve model differential equations and sensitivity equations
simultaneously
» Potentially large number of simultaneous differential equations
• ns(1+p) differential equations
» Numerical conditioning
» “Direct”
– Solve model differential equations, sensitivity equations, sequentially
» Integrate model equations forward to next time step
» Integrate sensitivity equations forward, using updated values of
states
» “Decoupled Direct”
CHEE824 Winter 2004
J. McLellan
94
Interpreting sensitivity responses
Example – first-order linear differential equation with step input
y  
1
2

y  1 u
2
Step response
CHEE824 Winter 2004
Sensitivities
J. McLellan
95
Estimating parameters in differential equation models
• When there are multiple responses being measured (e.g.,
temperature, concentrations of different species), the resulting
estimation problem is a multi-response estimation problem
• Other issues
– Identifiability of parameters
– How “time” is treated – as independent variable (in my earlier
presentation), or treating responses at different times as different
responses
– Obtaining initial parameter estimates
» See for example discussion in Bates and Watts, Seber and Wild
– Serial correlation in random noise
» Particularly if the random shocks enter in the differential
equation, rather than being additive to the measured responses
CHEE824 Winter 2004
J. McLellan
96
Multi-response estimation
Multi-response estimation refers to the case in which observations
are taken on more than one response variable
Examples
– Measuring several different variables – concentration, temperature,
yield
– Measuring a functional quantity at a number of different index values
– examples
» molecular weight distribution – measuring differential weight
fraction at a number of different chain lengths
» particle size distribution – measuring differential weight fraction
at a number of different particle size bins
» Time response – treating response at different times as
individual responses
» Spatial temperature distribution – treating temperature at
different spatial locations as individual responses
CHEE824 Winter 2004
J. McLellan
97
Multi-response estimation
Problem formulation
– Responses
» n runs
» m responses
Y  Y1 Y2  Y m 
 y11 y12  y1m 


y
y
y
22
2m 
 21



 


y

 n1 yn 2  ynm 
– Model equations
» m model equations – one for each response – evaluated at n run
conditions
» Model for jth response evaluated at ith run conditions
  
H  hij  f j (xi , )
CHEE824 Winter 2004
J. McLellan

98
Multi-response estimation
• Random noise
– We have a random noise for each observation of each response –
denote random noise in jth response observed at ith run conditions as
Zij
– Have matrix of random noise elements
 Z11 Z12  Z1m 


Z
Z
Z
22
2m 
 21
Z  Z ij  


 


Z

 n1 Z n 2  Z nm 
 
Between run
correlation?
Within run
correlation?
– Issue – what is the correlation structure of the random noise?
CHEE824 Winter 2004
J. McLellan
99
Multi-response estimation
Covariance structure of the random noise – possible structures
– No covariance between the random noise components – all random
noise components are independent, identically distributed?
» Can use least squares solution in this instance
– Within run covariance – between responses – that is the same for
each run condition
» Responses have a certain inherent covariance structure
» Covariance matrix
» Determinant criterion for estimation
» Alternative – generalized least squares – stack observations
– Between run covariance
– Complete covariance – between runs, across responses
CHEE824 Winter 2004
J. McLellan
100