Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Module 2: Nonlinear Regression Outline Single response • • • • Notation Assumptions Least Squares Estimation – Gauss-Newton Iteration, convergence criteria, numerical optimization Diagnostics Properties of Estimators and Inference Other estimation formulations – maximum likelihood and Bayesian estimators Dealing with differential equation models • And then on to multi-response… • • • CHEE824 Winter 2004 J. McLellan 2 Notation Model: random noise component Yi f (xi , ) i explanatory variables – ith run conditions p-dimensional vector of parameters f (x1, ) f ( xi , ) the model equation is with n experimental runs, we have f ( x 2 , ) ( ) defines the expectation surface ( ) the nonlinear regression model is Y ( ) f (x n , ) Model specification involves form of equation Model specification – – – – – – and parameterization CHEE824 Winter 2004 J. McLellan 3 Example #1 (Bates and Watts, 1988) Rumford data – – Cooling experiment – grind cannon barrel with blunt bore, and then monitor temperature while it cools » Newton’s law of cooling – differential equation with exponential solution » Independent variable is t (time) » Ambient T was 60 F » Model equation t f (t , ) 60 70e » 1st-order dynamic decay CHEE824 Winter 2004 J. McLellan 4 Rumford Example • Consider two observations – 2-dimensional observation space » At t=4, t=41 min CHEE824 Winter 2004 J. McLellan 5 Parameter Estimation – Linear Regression Case approximating observation vector observations residual vector y yˆ X̂ Expectation surface X CHEE824 Winter 2004 J. McLellan 6 Parameter Estimation - Nonlinear Regression Case approximating observation vector observations residual vector y yˆ (ˆ) expectation surface ( ) CHEE824 Winter 2004 J. McLellan 7 Parameter Estimation – Gauss-Newton Iteration Least squares estimation – minimize 2 T y ( ) e e S ( ) Iterative procedure consisting of: 1. Linearization about the current estimate of the parameters 2. Solution of the linear(ized) regression problem to obtain the next parameter estimate 3. Iteration until a convergence criterion is satisfied CHEE824 Winter 2004 J. McLellan 8 Linearization about a nominal parameter vector Linearize the expectation function η(θ) in terms of the parameter vector θ about a nominal vector θ0: ( ) (0 ) V0 ( 0 ) (0 ) V0 Sensitivity Matrix -Jacobian of the expectation function -contains first-order sensitivity information CHEE824 Winter 2004 f (x1, ) f (x1, ) p 1 ( ) V0 T 0 f (x n , ) f (x n , ) p 1 J. McLellan 0 9 Parameter Estimation – Gauss-Newton Iteration Iterative procedure consisting of: 1. Linearization about the current estimate of the parameters y ( (i ) ) V (i ) (i 1) 2. Solution of the linear(ized) regression problem to obtain the next parameter estimate update 3. (i 1) (V (i )T (i ) 1 V ) V (i )T (y ( (i ) )) Iteration until a convergence criterion is satisfied – for example, (i 1) (i ) CHEE824 Winter 2004 tol J. McLellan 10 Parameter Estimation - Nonlinear Regression Case approximating observation vector observations y ( (i ) ) V (i ) (i 1) Tangent plane approximation (i ) (i ) ( CHEE824 Winter 2004 ) V J. McLellan 11 Quality of the Linear Approximation … depends on two components: 1. Degree to which the tangent plane provides a good approximation to the expectation surface - the planar assumption - related to intrinsic nonlinearity 2. Uniformity of the coordinates on the expectation surface – uniform coordinates - the linearization implies a uniform coordinate system on the tangent plane approximation – equal changes in a given parameter produce equal sized increments on the tangent plane - equal-sized increments in a given parameter may map to unequalsized increments on the expectation surface CHEE824 Winter 2004 J. McLellan 12 Rumford Example • Consider two observations – 2-dimensional observation space » At t=4, t=41 min θ=0 Non-uniformity in coordinates Tangent plane approximation θ = 10 CHEE824 Winter 2004 θ changed in increments of 0.025 J. McLellan 13 Rumford example • Model function f (t , ) 60 70e • Dataset consists of 13 observations t • Exercise – sensitivity matrix? » Dimensions? CHEE824 Winter 2004 J. McLellan 14 Rumford example – tangent approximation • At θ = 0.05, Non-uniformity in coordinates Note uniformity in coordinates on tangent plane Tangent plane approximation CHEE824 Winter 2004 J. McLellan 15 Rumford example – tangent approximation • At θ = 0.7, CHEE824 Winter 2004 J. McLellan 16 Parameter Estimation – Gauss-Newton Iteration Parameter estimation after jth iteration: ( j ) ( j 1) ( j ) Convergence – can be declared by looking at: » relative progress in the parameter estimate » (i ) tol relative progress in reducing the sum of squares function S ( (i 1) ) S ( (i ) ) S ( (i ) ) » (i 1) tol combination of both progress in sum of squares reduction and progress in parameter estimates CHEE824 Winter 2004 J. McLellan 17 Parameter Estimation – Gauss-Newton Iteration Convergence – – the relative change criteria in sum of squares or parameter estimates terminate on lack of progress, rather than convergence (Bates and Watts, 1988) alternative – due to Bates and Watts, termed the relative offset criterion » we will have converged to the true optimum (least squares estimates) if the residual vector is orthogonal to the nonlinear expectation surface, and in particular, its tangent plane approximation at the true parameter values » if we haven’t converged, the residual vector won’t necessarily be orthogonal to the tangent plane at the current parameter iterate e y ( ) CHEE824 Winter 2004 J. McLellan 18 Parameter Estimation – Gauss-Newton Iteration Convergence » declare convergence by comparing component of residual vector lying on tangent plane to the component orthogonal to the tangent plane – if the component on the tangent plane is small, then we are close to orthogonality convergence » Note also that after each iteration, the residual vector is orthogonal to the tangent plane computed at the previous parameter iterate (where the linearization is conducted), and not necessarily to the tangent plane and expectation surface at the most recently computed parameter estimate Q1T (y ( (i ) )) / p QT2 (y ( (i ) )) / N p CHEE824 Winter 2004 J. McLellan 19 Computational Issues in Gauss-Newton Iteration The Gauss-Newton iteration can be subject to poor numerical conditioning, as the linearization is recomputed at new parameter iterates » Conditioning problems arise in inversion of VTV » Solution – use a decomposition technique • QR decomposition • Singular Value Decomposition (SVD) » Decomposition techniques will accommodate changes in rank of the Jacobian (sensitivity) matrix V CHEE824 Winter 2004 J. McLellan 20 QR Decomposition An n x p matrix V takes vectors from a p-dimensional space into an n-dimensional space M N V n-dimensional e.g., n=3 p-dimensional e.g., p=2 CHEE824 Winter 2004 J. McLellan 21 QR Decomposition • • The columns of the matrix V (viewed as a linear mapping) are the images of the basis vectors for the domain space (M) expressed in the basis of the range space (N) If M is a p-dimensional space, and N is an n-dimensional space (with p<n), then V defines a p-dimensional linear subspace in N as long as V is of full rank – Think of our expectation plane in the observation space for the linear regression case – the observation space is n-dimensional, while the expectation plane is p-dimensional where p is the number of parameters • We can find a new basis for the range space (N) so that the first p basis vectors span the range of the mapping V, and the remaining n-p basis vectors are orthogonal to the range space of V CHEE824 Winter 2004 J. McLellan 22 QR Decomposition • • • In the new range space basis, the mapping will have zero elements in the last n-p elements of the mapping vector since the last n-p basis vectors are orthogonal to the range of V By construction, we can express V as an upper-triangular matrix This is a QR decomposition V QR R1 q1 q 2 q n 0 CHEE824 Winter 2004 J. McLellan 23 QR Decomposition • 1 1 X 1 0 1 1 Example – linear regression with y3 β2 X 1 1 1 y2 β1 y1 CHEE824 Winter 2004 1 0 1 Perform QR decomposition X QR J. McLellan 24 QR Decomposition • In the new basis, the expectation plane becomes 0 1.7032 ~ X QT X R 0 1.4142 0 0 z3 β2 1 .7 0 0 0 1 .4 0 β1 z2 z1 CHEE824 Winter 2004 J. McLellan 25 QR Decomposition • The new basis for the range space is given by the columns of Q 0.5774 0.7071 0.4082 Q 0.5774 0 0.8165 0.5774 0.7071 0.4082 Visualize the new basis vectors for the observation space relative to the original basis y3 q2 q3 z1 is distance along q1, z2 is distance along q2, z3 is distance along q3 y2 q1 y1 CHEE824 Winter 2004 J. McLellan 26 QR Decomposition • In the new coordinates, z3 z1 is distance along q1, z2 is distance along q2, 0 1 .4 0 z3 is distance along q3 1 .7 0 0 z2 z1 CHEE824 Winter 2004 J. McLellan 27 QR Decomposition There are various ways to compute a QR decomposition – Gram-Schmidt orthogonalization – sequential orthogonalization – Householder transformations – sequence of reflections CHEE824 Winter 2004 J. McLellan 28 QR Decompositions and Parameter Estimation How does QR decomposition aid parameter estimation? – QR decomposition will identify the effective rank of the estimation problem through the process of computing the decomposition » # of vectors spanning range space of V is the effective dimension of the estimation problem » If dimension changes with successive linearizations, QR decomposition will track this change » Reformulating the estimation problem using a QR decomposition improves the numerical conditioning and ease of solution for the problem » Over-constrained problem: e.g., for the linear regression case, find β to come as close to satisfying Y X QR CHEE824 Winter 2004 R1 Q Y R 0 T J. McLellan 29 QR Decompositions and Parameter Estimation • • R1 is upper-triangular, and so the parameter estimates can be obtained sequentially The Gauss-Newton iteration follows the same pattern » Perform QR decompositions on each V • QR decomposition also plays an important role in understanding nonlinearity » Look at second-derivative vectors and partition them into components lying in the tangent plane (associated with tangential curvature) and those lying orthogonal to the tangent plane (associated with intrinsic curvature) » QR decomposition can be used to construct this partitioning • First p vectors span the tangent plane, remaining are orthogonal to it CHEE824 Winter 2004 J. McLellan 30 Singular Value Decomposition • Singular value decompositions (SVDs) are similar to eigenvector decompositions for matrices • SVD: X UV T Where » U is the “output rotation matrix” » V is the “input rotation matrix” (pls don’t confuse with Jacobian!) » Σ is a diagonal matrix of singular values CHEE824 Winter 2004 J. McLellan 31 Singular Value Decomposition • • • • • Singular values: i i XT X i.e., the positive square root of the eigenvalues of XTX, which is square (will be pxp, where p is the number of parameters) Input singular vectors form the columns of V, and are the eigenvectors of XTX Output singular vectors form the columns of U, and are the eigenvectors of X XT One perspective – find new bases for the input space (parameter space) and output space (observation space) in which X becomes a diagonal matrix – only performs scaling, no rotation For parameter estimation problems, U will be nxn, and V will be pxp; Σ will be nxp CHEE824 Winter 2004 J. McLellan 32 SVD and Parameter Estimation • • SVD will accommodate effective rank of the estimation problem, and can track changes in the rank of the problem » Recent work tries to alter the dimension of the problem using SVD information SVD can improve the numerical conditioning and ease of solution of the problem CHEE824 Winter 2004 J. McLellan 33 Other numerical estimation methods • Focus on minimizing the sum of squares function using optimization techniques • Newton-Raphson solution – Solve for increments using second-order approximation of sum of squares function • Levenberg-Marquardt compromise – Modification of the Gauss-Newton iteration, with introduction of factor to improve conditioning of linear regression step • Nelder-Mead – Pattern search method – doesn’t use derivative information • Hybrid approaches – Use combination of derivative-free and derivative-based methods CHEE824 Winter 2004 J. McLellan 34 Other numerical estimation methods • In general, the least squares parameter estimation approach represents a minimization problem • Use optimization technique to find parameter estimates to minimize the sum of squares of the residuals CHEE824 Winter 2004 J. McLellan 35 Newton-Raphson approach • Start with the residual sum of squares function S(θ) and form the 2nd-order Taylor series expansion: S ( ) S ( (i ) ) S ( ) T ( (i ) (i ) where H is the Hessian of S(θ): 1 ) ( (i ) )T H( (i ) ) 2 H » the Hessian is the multivariable secondderivative for a function of a vector 2 S ( ) T (i ) • Now solve for the next move by applying the stationarity condition (take 1st derivative, set to zero) ( (i ) ) H 1 CHEE824 Winter 2004 J. McLellan S ( ) T (i ) 36 Hessian • Is the matrix of second derivatives – (consider using Maple to generate!) H 2 S ( ) T (i ) CHEE824 Winter 2004 2 S ( ) 2 1 2 S ( ) 1 2 2 S ( ) 1 p 2 S ( ) 1 2 2 S ( ) 22 J. McLellan 2 S ( ) p 1 p 2 S ( ) 1 p 2 S ( ) p 1 p 2 S ( ) 2p (i ) 37 Jacobian and Hessian of S(θ) • Can be found by the chain rule: S ( ) H 2 T 2 S ( ) T 2 ( ) T 2 2 ( ) T (y ( )) 2 ( ) T (y ( )) 2 (y ( )) 2V V T 3-dimensional array (tensor) CHEE824 Winter 2004 J. McLellan the sensitivity matrix that we had before: V T ( ) ( ) T T Often used as an approximation of the Hessian – “expected value of the Hessian” 38 Newton-Raphson approach • Using the approximate Hessian (which is always positive semidefinite), the change in parameter estimate is: ( (i ) ) H 1 S ( ) T (i ) (VT V ) 1 VT (y ( )) where V is evaluated at θ(i) is the sensitivity matrix. • This is the Gauss-Newton iteration! • Issues – computing and updating the Hessian matrix » Potential better progress – information about curvature » Hessian can cease to be positive definite (required in order for stationary point to be a minimum) CHEE824 Winter 2004 J. McLellan 39 Levenberg-Marquardt approach • Improve the conditioning of the inverse by adding a factor – biased regression solution – • Levenberg modification (i 1) (i )T (i ) (V V 1 I p ) V (i )T where Ip is the pxp identity matrix • Marquardt modification (i 1) (i )T (i ) 1 (i )T (V V D) V (y ( (i ) )) (y ( (i ) )) where D is a matrix containing the diagonal entries of VTV • If λ -> 0, approach Gauss-Newton iteration • If λ -> ∞, approach direction of steepest ascent – optimization technique CHEE824 Winter 2004 J. McLellan 40 Inference – Joint Confidence Regions • • Approximate confidence regions for parameters and predictions can be obtained by using a linearization approach Approximate covariance matrix for parameter estimates: ˆ TV ˆ ) 1 2 ˆ (V • • where V̂ denotes the Jacobian of the expectation mapping evaluated at the least squares parameter estimates This covariance matrix is asymptotically the true covariance matrix for the parameter estimates as the number of data points becomes infinite 100(1-α)% joint confidence region for the parameters: ˆ TV ˆ ( ˆ) p s2 Fp,n p, ( ˆ)T V » compare to the linear regression case CHEE824 Winter 2004 J. McLellan 41 Inference – Marginal Confidence Intervals • Marginal confidence intervals » Confidence intervals on individual parameters ˆi t , / 2 sˆ i sˆ is the approximate standard error of the parameter where estimate –i i-th diagonal element of the approximate parameter estimate covariance matrix, with noise variance estimated as in the linear case ˆ TV ˆ ) 1 s2 ˆ (V CHEE824 Winter 2004 J. McLellan 42 Inference – Predictions & Confidence Intervals • Confidence intervals on predictions of existing points in the dataset – Reflect propagation of variability from the parameter estimates to the predictions – Expressions for nonlinear regression case based on linear approximation and direct extension of results for linear regression First, let’s review the linear regression case… CHEE824 Winter 2004 J. McLellan 43 Precision of the Predicted Responses - Linear From the linear regression module (module 1) – The predicted response from an estimated model has uncertainty, because it is a function of the parameter estimates which have uncertainty: e.g., Solder Wave Defect Model - first response at the point -1,-1,-1 y1 0 1( 1) 2 ( 1) 3( 1) If the parameter estimates were uncorrelated, the variance of the predicted response would be: Var ( y1) Var ( 0 ) Var ( 1) Var ( 2 ) Var ( 3) (recall results for variance of sum of random variables) CHEE824 Winter 2004 J. McLellan 44 Precision of the Predicted Responses - Linear In general, both the variances and covariances of the parameter estimates must be taken into account. For prediction at the k-th data point: Var ( yˆ k ) xTk ( XT X) 1 x k 2 xk 1 xk 1 x T 1 k 2 2 xk 2 xkp ( X X) xkp Note - Var ( yˆ k ) xTk ( XT X) 1 x k 2 xTk ˆ x k CHEE824 Winter 2004 J. McLellan 45 Precision of the Predicted Responses - Nonlinear Linearize the prediction equation about the least squares estimate: f (x k , ) ˆ ˆ) f (x k ,ˆ) vTk ( ˆ) yˆ k f (x k , ) ( T ˆ For prediction at the k-th data point: ˆ TV ˆ ) 1 vˆ k 2 Var ( yˆ k ) vˆ Tk (V vˆk1 vˆk 2 T 1 2 ˆ V ˆ) vˆk1 vˆk 2 vˆkp (V vˆ kp T ˆ T ˆ 1 V) vˆ k 2 vˆ Tk ˆ vˆ k Note - Var ( yˆ k ) vˆ k (V CHEE824 Winter 2004 J. McLellan 46 Estimating Precision of Predicted Responses Use an estimate of the inherent noise variance s 2yˆ xTk ( XT X) 1 x k s2 k linear s 2yˆ vTk (VT V ) 1 v k s2 k nonlinear The degrees of freedom for the estimated variance of the predicted response are those of the estimate of the noise variance » replicates » external estimate » MSE CHEE824 Winter 2004 J. McLellan 47 Confidence Limits for Predicted Responses Linear and Nonlinear Cases: Follow an approach similar to that for parameters - 100(1-α)% confidence limits for predicted response at the k-th run are: yk t , / 2 syk » degrees of freedom are those of the inherent noise variance estimate If the prediction is for a response at conditions OTHER than one of the experimental runs, the limits are: yˆ k t , / 2 s 2yˆ se2 k CHEE824 Winter 2004 J. McLellan 48 Precision of “Future” Predictions - Explanation Suppose we want to predict the response at conditions other than those of the experimental runs --> future run. The value we observe will consist of the component from the deterministic component, plus the noise component. In predicting this value, we must consider: » uncertainty from our prediction of the deterministic component » noise component The variance of this future prediction is Var ( y ˆ ) 2 where Var ( yˆ ) is computed using the same expression for variance of predicted responses at experimental run conditions - For linear case, with x containing specific run conditions, Var ( yˆ ) xT ( XT X) 1 x 2 xT ˆ x CHEE824 Winter 2004 J. McLellan 49 Properties of LS Parameter Estimates Key Point - parameter estimates are random variables » because of how stochastic variation in data propagates through estimation calculations » parameter estimates have a variability pattern - probability distribution and density functions Unbiased E{} » “average” of repeated data collection / estimation sequences will be true value of parameter vector CHEE824 Winter 2004 J. McLellan 50 Properties of Parameter Estimates Consistent » behaviour as number of data points tends to infinity » with probability 1, lim N » distribution narrows as N becomes large Efficient » variance of least squares estimates is less than that of other types of parameter estimates CHEE824 Winter 2004 J. McLellan 51 Properties of Parameter Estimates Linear Regression Case – Least squares estimates are – » Unbiased » Consistent » Efficient Nonlinear Regression Case – Least squares estimates are – » Asymptotically unbiased – as number of data points becomes infinite » Consistent » efficient CHEE824 Winter 2004 J. McLellan 52 Maximum Likelihood Estimation Concept – • Start with function which describes likelihood of data given parameter values » Probability density function • Now change perspective – assume that data observed are the most likely, and find parameter values to make the data the most likelihood » Likelihood of parameters given observed data • Estimates are “maximum likelihood” estimates CHEE824 Winter 2004 J. McLellan 53 Maximum Likelihood Estimation • For Normally distributed data (random shocks) • Recall that for a given run, we have Yi f (xi , ) i , i ~ N (0, 2 ) • Probability density function for Yi: » Mean is given by f(xi,θ), and variance is 2 1 2 fYi ( y ) exp ( y f (xi , )) 2 2 1 CHEE824 Winter 2004 J. McLellan 54 Maximum Likelihood Estimation • With n observations, given that the responses are independent (since the random shocks are independent), the joint density function for the observations is simply the product of the individual density functions: 1 exp ( yi f (xi , ))2 2 i 1 2 1 n 1 2 exp ( yi f (xi , )) n/2 n (2 ) 2 i 1 n fY1Yn ( y1, yn ) CHEE824 Winter 2004 1 J. McLellan 55 Maximum Likelihood Estimation • In shorthand, using vector notation for the observations, and now explicitly acknowledging that we “know”, or are given, the parameter values: 1 n 1 2 f Y (y | , ) exp ( yi f (xi , )) n/2 n (2 ) 2 i 1 1 T exp ( y ( )) ( y ( )) n/2 n (2 ) 2 1 Note that we have written the sum of squares in vector notation as well, using the expectation mapping. • Note also that the random noise standard deviation is also a parameter CHEE824 Winter 2004 J. McLellan 56 Likelihood Function • Now, we have a set of observations, which we will assume are the most likely, and we now define the likelihood function: 1 n 2 l ( , | y ) exp ( yi f (xi , )) n/2 n (2 ) 2 i 1 1 1 T exp (y ( )) (y ( )) n/2 n (2 ) 2 1 CHEE824 Winter 2004 J. McLellan 57 Log-likelihood function • We can also work with the log-likelihood function, which extracts the important part of the expression from the exponential: L( , | y ) n ln( ) ln( ) CHEE824 Winter 2004 1 2 1 n 2 ( yi f (xi , )) 2 i 1 (y ( ))T (y ( )) J. McLellan 58 Maximum Likelihood Parameter Estimates • Formal statement as optimization problem: 1 n 2 max l ( , | y ) max exp ( yi f (xi , )) n / 2 n , , (2 ) 2 i 1 1 1 T max exp (y ( )) (y ( )) n / 2 n , (2 ) 2 1 CHEE824 Winter 2004 J. McLellan 59 Maximum Likelihood Estimation • Examine the likelihood function: 1 n 2 l ( , | y ) exp ( yi f (xi , )) n/2 n (2 ) 2 i 1 1 1 T exp ( y ( )) ( y ( )) n/2 n (2 ) 2 1 • Regardless of the noise standard deviation, the likelihood function will be maximized by those parameter values minimizing the sum of squares between the observed data and the model predictions » These are the parameter values that make the observed data the “most likely” CHEE824 Winter 2004 J. McLellan 60 Maximum Likelihood Estimation • In terms of the residual sum of squares function, we have the likelihood function: 1 l ( , | y ) exp S ( ) n/2 n (2 ) 2 1 and the log-likelihood function: L( , | y ) n ln( ) CHEE824 Winter 2004 J. McLellan 1 2 S ( ) 61 Maximum Likelihood Estimation • We can obtain the optimal parameter estimates separately from the noise standard deviation, given the form of the likelihood function » Minimize sum of squares of residuals – not a function of noise standard deviation • For Normally distributed data, the maximum likelihood parameter estimates are the same as the least squares estimates for nonlinear regression • The maximum likelihood estimate for the noise variance is the mean squared error (MSE), 2 S (ˆ) s n » Obtain by taking derivative with respect to the variance, and then solving CHEE824 Winter 2004 J. McLellan 62 Maximum Likelihood Estimation Further comments: • We could develop the likelihood function starting with the distribution of the random shocks, ε, producing the same expression • If the random shocks were independent, but had a different distribution, then the observations would also have a different distribution » Expectation function defines means of this distribution n fY1Yn ( y1, yn | ) g ( yi ; xi , ) i 1 where g is the individual density function » Could then develop a likelihood function from this density fn. CHEE824 Winter 2004 J. McLellan 63 Inference Using Likelihood Functions • Generate likelihood regions – contours of the likelihood function » Choice of contour value comes from examining distribution • Unlike the least squares approximate inference regions, which were developed using linearizations, the likelihood regions need not be elliptical or ellipsoidal » Can have banana shapes, or can be open contours • Likelihood regions – first, examine the likelihood function: 1 l ( , | y ) exp S ( ) n/2 n (2 ) 2 1 – The dependence of the likelihood function on the parameters is through the sum of squares function S(θ) CHEE824 Winter 2004 J. McLellan 64 Likelihood regions S ( ) S (ˆ) p ~ Fp,n p ˆ S ( ) n p • Focusing on S(θ), we have – Note that the denominator is the MSE – residual variance • This is an asymptotic result in the nonlinear case, and an exact result for the linear regression case • We can generate likelihood regions as values of θ such that S ( ) S (ˆ)[1 CHEE824 Winter 2004 p Fp,n p ] n p J. McLellan 65 Likelihood regions – further comments • The likelihood regions are essentially sums of squares contours – Specifically for case where data are Normally distributed • In the nonlinear regression case, ˆ TV ˆ ( ˆ) S ( ) S (ˆ) ( ˆ)T V and so the likelihood contours are approximated by the linearization-based approximate joint confidence region from least squares theory ˆ TV ˆ ( ˆ) p s2 Fp,n p, ( ˆ)T V CHEE824 Winter 2004 J. McLellan 66 Likelihood regions – further comments • Using S ( ) S (ˆ)[1 p Fp,n p ] n p is an approximate approach that approximates the exact likelihood region – Approximation is in the sampling distribution argument used to derive the expression in terms of the F distribution – This is asymptotically (as the number of data points becomes infinite) an exact likelihood region • In general, an exact likelihood region would be given by S ( ) c S (ˆ) for some appropriately chosen constant “c” – Note that in the approximation, CHEE824 Winter 2004 p c [1 Fp,n p ] n p J. McLellan 67 Likelihood regions further comments • In general, the difficulty in using S ( ) c S (ˆ) lies in finding a value of “c” that gives the correct coverage probability – The coverage probability is the probability that the region contains the true parameter values – The approximate result using the F-distribution is an attempt to get such a coverage probability – The likelihood contour is reported to give better coverage probabilities for smaller data sets, and is less affected by nonlinearity » Donaldson and Schnabel(1987) CHEE824 Winter 2004 J. McLellan 68 Likelihood regions - Examples • Puromycin – from Bates and Watts (untreated cases) – Red is 95% likelihood region – Blue is 95% confidence region (linear approximation) – Note some difference in shape, orientation and size, but not too pronounced – Square indicates least squares estimates – Maple worksheet available on course web CHEE824 Winter 2004 J. McLellan 69 Likelihood Regions - Examples • BOD – from Bates and Watts – Red is 95% likelihood region – Blue is 95% confidence region (linear approximation) – Note significant difference in shapes – Note that confidence ellipse includes the value of 0 for θ2 – Square indicates least squares estimates – Maple worksheet available on course web CHEE824 Winter 2004 J. McLellan 70 Bayesian estimation Premise – – The distribution of observations is characterized by parameters which in turn have some distribution of their own – Concept of prior knowledge of the values that the parameters might assume • Model Y ( ) • Noise characteristics ~ i.i.d . N (0, 2 ) • Approach – use Bayes’ theorem CHEE824 Winter 2004 J. McLellan 71 Conditional Expectation Recall conditional probability: P( X Y ) P( X | Y ) P(Y ) » probability of X given Y, where X and Y are events For continuous random variables, we have a conditional probability density function expressed in terms of the joint and marginal distribution functions: f XY ( x, y ) f X |Y ( x | y ) fY ( y ) Note - Using this, we can also define the conditional expectation of X given Y: E{ X | Y } x f X |Y ( x | y )dx CHEE824 Winter 2004 J. McLellan 72 Bayes’ Theorem • useful for situations in which we have incomplete probability knowledge • forms basis for statistical estimation • suppose we have two events, A and B • from conditional probability: P( A B) P( A | B) P( B) P( B A) P( B | A) P( A) so P( B | A) P( A) P( A | B) P( B) for P(B)>0 CHEE824 Winter 2004 J. McLellan 73 Bayesian Estimation • Premise – parameters can have their own distribution – prior distribution f ( , ) • The posterior distribution of the parameters can be related to the prior distribution of the parameters and the likelihood function: f ( , , y ) f ( , | y ) f ( y) f ( y | , ) f ( , ) f ( y) f ( y | , ) f ( , ) CHEE824 Winter 2004 f ( , | y ) Posterior Distribution - of parameters given data J. McLellan 74 Bayesian Estimation • The noise standard deviation σ is a nuisance parameter, and we can focus instead on the model parameters: f ( | y ) f ( y | ) f ( ) • How are the posterior distributions with/without σ related? f ( | y) f ( , | y )d CHEE824 Winter 2004 J. McLellan 75 Bayesian estimation • Bayes’ theorem • Posterior density function in terms of prior density function • Equivalence for normal with uniform prior – least squares / maximum likelihood estimates • Inference – posterior density regions CHEE824 Winter 2004 J. McLellan 76 Diagnostics for nonlinear regression • Similar to linear case • Qualitative – residual plots – Residuals vs. » Factors in model » Sequence (observation) number » Factors not in model (covariates) » Predicted responses – Things to look for: » Trend remaining » Non-constant variance » Meandering in sequence number – serial correlation • Qualitative – plot of observed and predicted responses – Predicted vs. observed – slope of 1 – Predicted and observed – as function of independent variable(s) CHEE824 Winter 2004 J. McLellan 77 Diagnostics for nonlinear regression • Quantitative diagnostics – Ratio tests: » MSR/MSE – as in the linear case – coarse measure of significant trend being modeled » Lack of fit test – if replicates are present • As in linear case – compute lack of fit sum of squares, error sum of squares, compare ratio » R-squared • coarse measure of significant trend • squared correlation of observed and predicted values • adjusted R-squared • squared correlation of observed and predicted values CHEE824 Winter 2004 J. McLellan 78 Diagnostics for nonlinear regression • Quantitative diagnostics – Parameter confidence intervals: » Examine marginal intervals for parameters • Based on linear approximations • Can also use hypothesis tests » Consider dropping parameters that aren’t statistically significant » Issue in this case – parameters are more likely to be involved in more complex expression involving factors, parameters • E.g., Arrhenius reaction rate expression » If possible, examine joint confidence regions, likelihood regions, HPD regions • Can also test to see if a set of parameter values lie in a particular region squared correlation of observed and predicted values CHEE824 Winter 2004 J. McLellan 79 Diagnostics for nonlinear regression • Quantitative diagnostics – Parameter estimate correlation matrix: » Examine correlation matrix for parameter estimates • Based on linear approximation • Compute covariance matrix, then normalize using pairs of standard deviations » Note significant correlations and keep these in mind when retaining/deleting parameters using marginal significance tests » Significant correlation between some parameter estimates may indicate over-parameterization relative to the data collected • Consider dropping some of the parameters whose estimates are highly correlated • Further discussion – Chapter 3 - Bates and Watts (1988), Chapter 5 - Seber and Wild (1988) CHEE824 Winter 2004 J. McLellan 80 Practical Considerations • Convergence – – “tuning” of estimation algorithm – e.g., step size factors – Knowledge of the sum of squares (or likelihood or posterior density) surface – are there local minima? » Consider plotting surface – Reparameterization • Ensuring physically realistic parameter estimates – Common problem – parameters should be positive – Solutions » Constrained optimization approach to enforce non-negativity of parameters positive exp( ) » Reparameterization – for example positive 10 CHEE824 Winter 2004 J. McLellan 1 1 e Bounded between 0 and 1 81 Practical considerations • Correlation between parameter estimates – Reduce by reparameterization – Exponential example – 1 exp( 2 x) 1 exp( 2 ( x x0 x0 )) 1 exp( 2 x0 ) exp( 2 ( x x0 )) 1 exp( 2 ( x x0 )) CHEE824 Winter 2004 J. McLellan 82 Practical considerations • Particular example – Arrhenius rate expression E 1 1 1 E k0 exp ) k0 exp ( R T T T RT ref ref E 1 E 1 exp ( k0 exp ) RTref R T Tref E 1 1 kref exp ( ) R T Tref – Effectively reaction rate relative to reference temperature – Reduces correlation between parameter estimates and improves conditioning of estimation problem CHEE824 Winter 2004 J. McLellan 83 Practical considerations • Scaling – of parameters and responses • Choices – Scale by nominal values » Nominal values – design centre point, typical value over range, average value – Scale by standard errors » Parameters – estimate of standard devn of parameter estimate » Responses – by standard devn of observations – noise standard deviation – Combinations – by nominal value / standard error • Scaling can improve conditioning of the estimation problem (e.g., scale sensitivity matrix V), and can facilitate comparison of terms on similar (dimensionless) bases CHEE824 Winter 2004 J. McLellan 84 Practical considerations • Initial guesses – – – – From prior knowledge From prior results By simplifying model equations By exploiting conditionally linear parameters – fix these, estimate remaining parameters CHEE824 Winter 2004 J. McLellan 85 Dealing with heteroscedasticity • Problem it poses – precision of parameter estimates • Weighted least squares estimation • Variance stabilizing transformations – e.g., Box-Cox transformations CHEE824 Winter 2004 J. McLellan 86 Estimating parameters in differential equation models • Model is now described by a differential equation: dy f ( y, u, t; ); y (t0 ) y0 dt • Referred to as “compartment models” in the biosciences. • Issues – – Estimation – what is the effective expectation function here? » Integral curve or flow (solution to differential equation) – Initial conditions – known?, unknown and estimated?, fixed (conditional estimation)? – Performing Gauss-Newton iteration » Or other numerical approach – Solving differential equation CHEE824 Winter 2004 J. McLellan 87 Estimating parameters in differential equation models What is the effective expectation function here? – Differential equation model: dy f ( y, u, t; ); y (t0 ) y0 dt – y – response, u – independent variables (factors), t – becomes a factor as well – Expectation function is the solution to the differential equation, which is evaluated at different times at which observations are taken i ( ) y(ti , ui ; , y0 ) – Note implicit dependence on initial conditions, which may be assumed or estimated – Often this is a conceptual model and not an analytical solution – the solution is often the numerical solution at specific times - subroutine CHEE824 Winter 2004 J. McLellan 88 Estimating parameters in differential equation models • Expectation mapping 1 ( ) y (t1, u1; , y0 ) ( ) y ( t , u ; , y ) 0 2 2 2 ( ) ( ) y (t , u ; , y ) n n n 0 • Random noise – is assumed to be additive on the observations Y1 1 ( ) 1 Y ( ) 2 2 2 Y ( ) n n n CHEE824 Winter 2004 Y ( ) J. McLellan 89 Estimating parameters in differential equation models Estimation approaches – Least squares (Gauss-Newton/Newton-Raphson iteration), maximum likelihood, Bayesian – Will require sensitivity information – sensitivity matrix V y (t1, u1; , y0 ) y (t , u ; , y ) 2 2 0 V ( ) y (tn , u n ; , y0 ) How can we get sensitivity information without having an explicit solution to the differential equation model? CHEE824 Winter 2004 J. McLellan 90 Estimating parameters in differential equation models Sensitivity equations – We can interchange the order of differentiation in order to obtain the sensitivity differential equations – referred to as sensitivity equations dy d y f ( y, u, t ; ) y f ( y, u, t ; ) dt dt y y (t0 ) y0 – Note that the initial condition for the response may also be a function of the parameters – e.g., if we assume that the process is initially at steady-state parametric dependence through steady-state form of model – These differential equations are solved to obtain the parameter sensitivities at the necessary time points: t1, …tn CHEE824 Winter 2004 J. McLellan 91 Estimating parameters in differential equation models Sensitivity equations – The sensitivity equations are coupled with the original model differential equations – for the single differential equation (and response) case, we will have p+1 simultaneous differential equations, where p is the number of parameters dy f ( y , u , t ; ) dt d y f ( y, u, t ; ) y f ( y, u, t ; ) dt 1 y 1 1 d y f ( y, u, t ; ) y f ( y, u, t ; ) dt 2 y 2 2 d y f ( y, u, t ; ) y f ( y, u, t ; ) dt p y p p CHEE824 Winter 2004 J. McLellan 92 Estimating parameters in differential equation models Variations on single response differential equation models – Single response differential equation models need not be restricted to single differential equations – We really have a single measured output variable, and multiple factors » Control terminology – multi-input single-output (MISO) system Differential equation model x f (x, u, t ; ); x(t0 ) x0 y h(x, u, t ; ) Sensitivity d x f ( x, u, t ; ) x f ( x, u, t ; ) x(t0 ) x 0 ; , i 1,, p equations dt x i i i i i y h(x, u, t ; ) x h(x, u, t ; ) x CHEE824 Winter 2004 J. McLellan 93 Estimating parameters in differential equation models Options for solving the sensitivity equations – – Solve model differential equations and sensitivity equations simultaneously » Potentially large number of simultaneous differential equations • ns(1+p) differential equations » Numerical conditioning » “Direct” – Solve model differential equations, sensitivity equations, sequentially » Integrate model equations forward to next time step » Integrate sensitivity equations forward, using updated values of states » “Decoupled Direct” CHEE824 Winter 2004 J. McLellan 94 Interpreting sensitivity responses Example – first-order linear differential equation with step input y 1 2 y 1 u 2 Step response CHEE824 Winter 2004 Sensitivities J. McLellan 95 Estimating parameters in differential equation models • When there are multiple responses being measured (e.g., temperature, concentrations of different species), the resulting estimation problem is a multi-response estimation problem • Other issues – Identifiability of parameters – How “time” is treated – as independent variable (in my earlier presentation), or treating responses at different times as different responses – Obtaining initial parameter estimates » See for example discussion in Bates and Watts, Seber and Wild – Serial correlation in random noise » Particularly if the random shocks enter in the differential equation, rather than being additive to the measured responses CHEE824 Winter 2004 J. McLellan 96 Multi-response estimation Multi-response estimation refers to the case in which observations are taken on more than one response variable Examples – Measuring several different variables – concentration, temperature, yield – Measuring a functional quantity at a number of different index values – examples » molecular weight distribution – measuring differential weight fraction at a number of different chain lengths » particle size distribution – measuring differential weight fraction at a number of different particle size bins » Time response – treating response at different times as individual responses » Spatial temperature distribution – treating temperature at different spatial locations as individual responses CHEE824 Winter 2004 J. McLellan 97 Multi-response estimation Problem formulation – Responses » n runs » m responses Y Y1 Y2 Y m y11 y12 y1m y y y 22 2m 21 y n1 yn 2 ynm – Model equations » m model equations – one for each response – evaluated at n run conditions » Model for jth response evaluated at ith run conditions H hij f j (xi , ) CHEE824 Winter 2004 J. McLellan 98 Multi-response estimation • Random noise – We have a random noise for each observation of each response – denote random noise in jth response observed at ith run conditions as Zij – Have matrix of random noise elements Z11 Z12 Z1m Z Z Z 22 2m 21 Z Z ij Z n1 Z n 2 Z nm Between run correlation? Within run correlation? – Issue – what is the correlation structure of the random noise? CHEE824 Winter 2004 J. McLellan 99 Multi-response estimation Covariance structure of the random noise – possible structures – No covariance between the random noise components – all random noise components are independent, identically distributed? » Can use least squares solution in this instance – Within run covariance – between responses – that is the same for each run condition » Responses have a certain inherent covariance structure » Covariance matrix » Determinant criterion for estimation » Alternative – generalized least squares – stack observations – Between run covariance – Complete covariance – between runs, across responses CHEE824 Winter 2004 J. McLellan 100