Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 10 Estimation 10.1 Introduction Estimation, as considered here, involves a probabilistic experiment with two random vectors (rv’s) X and Y . The experiment is performed but only the resulting sample value y of the random vector Y is observed. The observer then “estimates” the sample value x from the observation y . For simplicity, we assume that the random vectors have continuous distribution functions with finite probability densities and conditional densities. We denote the estimate of the sample value x , as a function of the observed sample value y , by x̂ (y ). For a given observation y , one could, for example, choose x̂ (y ) to be the conditional mean, the conditional median, or the conditional mode of the rv X conditional on Y = y . Estimation problems occur in an amazing variety of situations, often referred to as measurement problems or recovery problems. For example, in communication systems, the timing and the phase of the transmitted signals must be recovered at the receiver. Often it is necessary to measure the channel, and finally, for analog data, the receiver must estimate the transmitted waveform at finely spaced sample times. In control systems, the state of the system must be estimated in order to generate appropriate control signals. In statistics, one tries to estimate parameters for some probabilistic system model from trial sample values. In any experimental science, one is always concerned with measuring quantities in the presence of noise and experimental error. The problem of estimation is very similar to that of detection. With detection, we must decide between a discrete set of alternatives on the basis of some observation, whereas here we make a selection from a vector continuum of choices. Although this does not appear to be a very fundamental di↵erence, it leads to a surprising set of di↵erences in approach. In many typical detection problems, the cost of di↵erent kinds of errors is secondary and we are concerned primarily with maximizing the probability of correct choice. In typical estimation problems, with a continuum of alternatives, the probability of selecting the exact correct value is zero, and we are concerned with making the error small according to some given criterion. A fundamental approach to estimation is to use a cost function, C(x , x̂ ) to quantify the 468 469 10.1. INTRODUCTION cost associated with an estimate x̂ when the true sample value is x . This cost function, C(x , x̂ ) is analogous to the cost C`k , defined in Chapter 8, of making decision ` when k is the correct hypothesis. The minimum cost criterion or Bayes criterion for estimation is to choose x̂ (y ), for the observed y , to minimize the expected cost, conditional on Y = y . Specifically, given the observation y , choose x̂ (y ) to minimize Z C [x , x̂ (y )] fX |Y (x | y ) dx = E [C[X , x̂(Y )] | Y =y ] . That is, for each y , x̂(y ) = arg min x̂ Z C [x , x̂ ] fX |Y (x | y )dx = arg min E [C[X , x̂ ] | Y =y ] . x̂ (10.1) The notation arg minu f (u) means the value of u that minimizes f (u) (if the minimum is not unique, any minimizing u can be chosen). This choice of estimate minimizes the expected cost for each observation y . Thus it also minimizes the expected cost when averaged over Y , i.e., it minimizes h i Z h i E C[X , X̂ (Y )] = E C[X , X̂ (Y )] | Y =y fY (y )dy over X̂ (Y ), i.e., for an estimation rule that maps each sample value y of Y into an estimate x̂(y ), X̂ (Y ) maps each sample point ! 2 ⌦ into x̂(y ) where y = Y (!). 10.1.1 The squared cost function In practice, the most important cost function is the squared cost function, def C[x 0 , x )] = X [xi x̂i ]2 . (10.2) i The estimate that minimizes E [C[x̂(Y ), X ]] for the squared cost function is called the Minimum Mean Square Error Estimate (MMSE)⇥P estimate. It is also often⇤called the Bayes least squares estimate. In order to minimize E Xi )2 | Y =y ⇤ over x̂(y ), it is i⇥(x̂i (y ) sufficient to choose x̂i (y ), for each i, to minimize E (x̂i (y ) Xi )2 | Y =y . This is simply the second moment of Xi around x̂i (y ), conditioned on Y = y . This is minimized by choosing x̂i (y ) to be the mean of Xi conditional on Y = y . Simple though this result is, it is a central result of estimation theory, and we state it as a theorem. Theorem 10.1.1. The MMSE estimate x̂(y), as a function of the observation Y = y, is given by Z x̂(y) = E [X | Y=y] = x fX|Y (x | y) dx (10.3) 470 CHAPTER 10. ESTIMATION We can also view the MMSE estimate as a rv X̂(Y ) as described above. Define the estimation error ⇠ as X̂(Y ) X . Since x̂(y ) is the conditional mean of X conditional on Y = y , we see that E [⇠⇠ |Y =y ] = 0 for all y , and thus2 E [⇠⇠ ] = 0. The covariance matrix of ⇠ is then given by h⇣ ⌘⇣ ⌘T i K⇠ = E [⇠⇠ ⇠ T ] = E X̂(Y ) X X̂(Y ) X (10.4) This can be simplified by recalling that E [⇠⇠ |Y =y ] = 0 for all y . Thus if q(y ) is any vector valued function of the same dimension as x and ⇠ , then E [⇠⇠ |Y =y ] g T (y ) = E [⇠⇠ g T (Y )|Y =y ] = 0. Averaging over y , we get the important relation that the MMSE estimation error and any such g(y ) satisfy E [⇠⇠ g T (Y )] = 0 (10.5) In section ??, we will interpret this equation as an orthogonality principle. For now, since g(y function (of the right dimension), we can replace it it with X̂(y ), getting h ) is an arbitrary i T E ⇠ X̂ (Y ) = 0. Substituting this into (10.4), we finally get a useful simplified expression for the covariance of the estimation error for MMSE estimation, K⇠ = 10.1.2 E [⇠⇠ X T ] (10.6) Other cost functions Subsequent sections of this chapter focus primarily on the squared cost function. First, however, we briefly discuss P several other cost functions. One is the absolute value cost function where C(x 0 , x ) = i |x0i xi |; this expected cost is minimized, for each y , by choosing x̂i (y ) to be the conditional median of fXi |Y (xi | y ) (see Exercise 10.1). The absolute value cost function weighs large estimation errors more lightly than the squared cost function. The reason for the greater importance of the squared cost function, however, has less to do with the relative importance of large errors than with the conceptual richness and computational ease of working with means and variances rather than medians and absolute errors. Another cost function is the maximum error cost function. Here, for some given number ✏, the cost is 1 if the magnitude of any component of the error exceeds ✏ and is 0 otherwise. That is, C(x 0 , x ) = 1 if |x0i xi | > ✏ for any i and C(x 0 , x ) = 0 otherwise. For any observation y , the expected minimum cost estimate is that value x̂(y ) that maximizes the conditional probability that X lies in a cube with sides of length 2✏ centered on x̂(y ). For epsilon very small, this is approximated by choosing X̂(y ) to be that x that maximizes fX |Y (x | y ) (i.e., for given y , one chooses the mode of fX |Y (x | y )). This estimation rule 2 As we will explain in section 8, E [⇠ ] is the expected value of the estimate bias. The bias of an estimate x̂(Y ) is defined there as E [x̂(Y ) x |X = x ]. This is a function of the sample value x . Bias and expected bias are often confused in the literature. 10.2. MMSE ESTIMATION FOR GAUSSIAN RANDOM VECTORS 471 is called the Maximum Aposteriori Probability (MAP) estimate. One awkward property of MAP estimation is that in many cases, the MAP estimate of xi is di↵erent from the ith component of the MAP estimation of x = (x1 , . . . , xm ) (see Exercise 10.1). Fortunately, in many of the most typical types of applications, the mean, the median, and the mode of the conditional distribution are all the same, so one often chooses between these estimates on the basis of conceptual or computational convenience. In studying detection, we focused on the MAP rule, whereas here we are focusing on MMSE. The reason for this is that in most detection problems, the probability of error is quite small, and, as we saw, the e↵ect of di↵erent costs for di↵erent kinds of errors was simply to change the thresholds. Here, since we are assuming that fX |Y (x | y ) is finite, the probability of choosing the actual sample value x exactly is equal to zero. Thus we must be satisfied with an approximate choice, and the question of cost cannot be avoided. Thus, in principle, attempting to maximize the probability of correct choice is foolish (since that probability will be zero anyway). The reason the MAP rule is often used is, first, analytical convenience, and second, some confidence, for the particular problem at hand, that the conditional mode is a reasonable choice. A more fundamental question in estimation theory is whether it is reasonable to assume a complete probabilistic model. A popular kind of model is that in which fY |X (y | x ) is specified, but one is unwilling or unable to specify fX (x ). For example, X might model a set of parameters about which almost nothing is known, whereas Y might be the sum of X plus statistically well characterized noise. In these situations, we shall still regard X as a random vector, but look at estimates that do not depend on fX (x ). This will allow us both to compare the two approaches easily and to employ the methodologies of probabilistic models. The most common estimate that does not depend on fX (x ) is called the maximum likelihood (ML) estimate; the maximum likelihood estimate x̂M L (y ) is defined as an x that maximizes fY |X (y | x ) for the given y . We can view the ML estimate as the limit of MAP estimates as the apriori density on X becomes closer and closer to uniform. For example, we could model X ⇠ N (0, 2 I) in the limit as 2 ! 1. This limiting density does not exist, since the density approaches 0 everywhere, but the ML estimate typically will approach a limit. All of the previous comments about MAP estimates clearly carry over to ML. 10.2 MMSE Estimation for Gaussian Random vectors Many estimation problems involve zero-mean jointly Gaussian random vectors X and Y . We say that X and Y are jointly non-singular if the covariance matrix KZ of Z T = (X T , Y T ) is non-singular. Throughout this chapter, except where explicitly stated otherwise, we assume that the rv X whose sample value is to be estimated and the observation rv Y are jointly non-singular. For X and Y jointly Gaussian and zero mean, we saw in (3.42) that X can always be represented as X = GY + V where V is a zero mean Gaussian rv that is independent of Y . This means that X , conditional on Y =y , is N (Gy , KV ). Thus for any sample value y of Y , the MMSE estimate is x̂(y ) = E [X |Y =y ] = Gy . The error, ⇠ = x̂(y ) X in the estimate is V . From (3.44) and (3.45), G = KX Y KY 1 1 T and KV = KX KX Y KY 1 KX Y . Thus, X , conditional on Y = y , is N (KX Y KY y , 472 KX CHAPTER 10. ESTIMATION KX Y KY 1 KX Y ) and x̂(y ) = Gy = KX Y KY 1 y Since the estimation error, ⇠ = (10.7) V , the covariance of the estimation error, K⇠ = KV , so, K⇠ = KX (10.8) T KX Y KY 1 KX Y The estimate in (10.7) is linear in the observation sample value y . Also, since Y and V are independent rv’s, the covariance of the estimation error does not depend on the observation y . That is, K⇠ = E [(x̂(y ) X )(x̂(y ) X )T | Y =y ] for each y (10.9) These are major simplifications that arise because the variables are jointly Gaussian. More generally, let X and Y be jointly Gaussian random vectors with arbitrary means and define the fluctuations, U = X E [X ] and Z = Y E [Y ]. The observation y of Y is equivalent to an observation z = y E [Y ] of Z . Since U and Z are zero mean, the MMSE estimate Û , given Z = z = y E [Y ], can be found from (10.7), i.e., Û = KU Z KZ 1 (y Since KX Y = E [(X rewritten as E [X ])(Y E [Y ]). E [Y ])T ] = KU Z and, similarly, KY = KZ , this can be Û = KX Y KY 1 (y E [Y ]). Finally, since X = U + E [X ], the corresponding estimate of X , given Y = y , is x̂(y ) = E [X ] + KX Y KY 1 (y E [Y ]) (10.10) This can be rewritten as x̂(y ) = b + KX Y KY 1 y where b = E [X ] KX Y KY 1 E [Y ] (10.11) The error covariance in estimating X from Y is clearly the same as that in estimating U from Z , so that there is no lack of generality in restricting our attention to zero mean. Note, however, that the estimate in (10.10) is a constant plus a linear function of y . This is properly called an affine transformation. These results are summarized in the following theorem: Theorem 10.2.1. If X and Y are jointly non-singular Gaussian random vectors, the MMSE estimate of X from the observation Y=y is given by (10.10) which, for the zero mean case, is (10.7). The covariance of the estimation error is given by (10.8), which gives both the covariance conditional on Y=y and the unconditional covariance. In what follows, we look at a number of examples of MMSE estimation for jointly Gaussian variables. We start from the simplest possible scalar cases and work our way up to recursive and Kalman estimation with vectors. 10.2. MMSE ESTIMATION FOR GAUSSIAN RANDOM VECTORS 473 Example 10.2.1 (Scalar signal plus noise). The simplest estimation problem one can imagine is to estimate X from the observation of Y = X + Z where X and Z are zero 2 and 2 (i.e., X ⇠ N (0, 2 ) and Z ⇠ mean independent Gaussian rv’s with variances X Z X 2 N (0, X )). Since X and Y are zero mean and jointly Gaussian, X, conditional on an T observation Y = y, is given by (10.7) and (10.8) as N (KXY KY 1 y, KX KXY KY 1 KXY ). 2 2 2 Since Y = X + Z, we have KXY = X and KY = X + Z . From (10.7), the MMSE estimate is KXY x̂(y) = 2 X y = 2 Y 2 X + 2 Z (10.12) y From the symmetry between X and Z, 2 Z Ẑ(y) = 2 X + 2 Z (10.13) y x̂(y) + Ẑ(y) = y, so the estimation simply splits the observation between signal and noise according to their variances. From an intuitive standpoint, recall that both X and Z are zero mean, so if one has a larger variance than the other, it is reasonable to attribute the major part of Y to the variable with the larger variance, as is done in (10.7) and (10.13). The estimation error, ⇠ = x̂(Y ) X = Z Ẑ(Y ), conditional on Y = y, is T N (0, KX KXY KY 1 KXY ). This does not depend on y, so ⇠ is statistically independent of⇥ Y . Because of this, the variance of the estimation error, conditional on Y = y, i.e., ⇤ 2 E (⇠) | Y =y does not depend on y and is the same as the unconditional variance of the error, which we call ⇠2 . Thus, for all y 2 ⇠ ⇣ = E X̂(y) = 2 X X ⌘2 2 X + 2 X 2 Z | Y =y = 2 X 2 KXY 2 Y 2 2 X Z 2 + 2 X Z = The most convenient way to write this is 1 2 ⇠ 1 = 2 X + 1 (10.14) 2 Z This form makes it clear that the variance of the estimate is smaller than the original 2 and 2 increase. variance of either X or Z, and that this variance increases as either X Z Example 10.2.2 (Attenuated signal plus noise). Now generalize the above problem to Y = hX + Z where h is a scale factor. X and Z are still zero mean, independent, and Gaussian. The conditional probability fX|Y (x|y) for given y is again N (KXY KY 1 y, KX T 2 and 2 = h2 2 + 2 , so (10.7) and (10.8) become KXY KY 1 KXY ). Now KXY = h X Y X Z x̂(y) = h h2 2 X 2 Xy + 2 Z ; 2 ⇠ = 2 X h2 h2 2 X 4 X + 2 Z = 2 2 X Z 2 + 2 h2 X Z (10.15) 474 CHAPTER 10. ESTIMATION This variance is best expressed as 1 1 = 2 ⇠ 2 X + h2 (10.16) 2 Z This follows directly from (10.15), and also is given by (39) in the Gaussian rv notes. It is insightful to also view the observation as Y /h = X + Z/h. Thus the variance of this scaled noise is Z2 /h2 , which shows that (10.15) and (10.16) follow directly from (10.12) and (10.14). 2 ). Z As a final extension, suppose Y = hX + Z and X has a mean X; i.e., X ⇠ N (X, X is still zero mean. Then E [Y ] = hX. Using (10.15) for the fluctuations in X and Y , and using (10.10), x̂(y) = X + 2 ⇠ = 2 X h2 h2 2 X 2 X + 2 Z 2 [y h X hX] 2 2 h X + Z2 ; 1 2 ⇠ = (10.17) 1 2 X + h2 2 Z (10.18) Example 10.2.3 (Scalar recursive estimation). Suppose that we make multiple noisy 2 ). After the ith observaobservations, y1 , y2 , . . . of a single random variable X ⇠ N (X, X tion, for each i, we want to make the best estimate of X based on y1 , . . . , yi . We shall see that this can be done recursively, using the estimate based on y1 , . . . , yi 1 to help form the estimate based on y1 , . . . , yi . The observation random variables are related to X by Yi = hi X + Zi , where {Zi } and X are independent Gaussian rv’s, Zi ⇠ N (0, Z2 i ). Let Y1;i denote the first i observation random variables, Y1 , . . . , Yi , and let y1;i denote the corresponding sample values y1 , . . . , yi . Let x̂(y1;i ) be the MMSE estimate of X based on the observation of Y1;i = y1;i and let ⇠2i be the variance of the estimation error, ⇠i = x̂(Y1;i ) X. For i = 1, we solved this problem in (10.17) and (10.18); given Y1 = y1 , x̂(y1 ) = X + 2 ⇠1 = 2 X h21 h21 2 X 2 [y h1 X h1 X] 1 2 2 h1 X + Z2 1 4 X + 2 Z1 ; 1 2 ⇠1 = (10.19) 1 2 X + h21 2 X1 (10.20) Conditional on Y1 = y1 , X is Gaussian with mean X̂(y1 ) and variance ⇠21 ; it is also independent of Z2 , which, conditional on Y1 = y1 is still N (0, Z2 2 ). Thus, in the portion of the sample space restricted to Y1 = y1 , and with probabilities conditional on Y1 = y1 , we want to estimate X from Y2 = h2 X + Z2 ; this is an instance of (10.17) (here using 475 10.2. MMSE ESTIMATION FOR GAUSSIAN RANDOM VECTORS conditional probabilities in the space Y1 = y1 ), with x̂(y1 ) in place of X, and h2 in place of h, so h2 x̂(y1;2 ) = x̂(y1 ) + 2 ⇠2 1 2 ⇠2 = h22 2 ⇠1 1 = 2 ⇠1 2 ⇠1 [y2 2 h22 X|Y 1 h22 + 2 ⇠1 h2 x̂(y1 )] + in place of 2 X, (10.21) 2 Z2 4 ⇠1 + 2 ⇠1 (10.22) 2 Z2 h22 (10.23) 2 Z2 We see that, conditional on Y1 = y1 and Y2 = y2 , X is Gaussian with mean x̂(y1;2 ) and variance ⇠22 . What we are doing here is first conditioning on Y1 =y1 , and then, in that conditional space, conditioning on Y2 =y2 . This is the same as directly conditioning on Y1 =y1 , Y2 =y2 . To see this by an equation, consider the following identity: fY2 |XY1 (y2 | xy1 ) f (x | y1 ). fY2 |Y1 (y2 | y1 ) X|Y1 fX|Y1 Y2 (x | y1 y2 ) = The expression on the right is the Bayes law form of the density of X given Y2 = y2 in the space restricted to Y1 = y1 . Iterating the argument in (10.21, 10.22) to arbitrary i > 1, we have x̂(y1;i ) = x̂(y1;i 2 ⇠i 1) = 2 ⇠i 1 x̂(y1;i ) = i X aj yj 2 ⇠i 2 ⇠i = 1 2 X + i X n=1 1 [yi h2i h2i 1 1 2 ⇠i 2 ⇠i + 1 hi x̂(y1;i 2 ⇠i 2 ⇠i 1 1 + 1 )] (10.24) 2 Zi (10.25) 1 + 2 Zi h2i (10.26) 2 Zi can also be found directly from (10.7) and (10.8), ; j=1 1 2 ⇠i h4i = 2 ⇠i As shown in Exercise 10.2, x̂(y1;i ) and and these yield the alternate forms + hi h2n / hj / Z2 j aj = Pi 2 )+ 2 (1/ X n=1 hn / 2 Zn 2 Zn (10.27) (10.28) 476 CHAPTER 10. ESTIMATION Example 10.2.4 (Scalar Kalman filter). We now extend the recursive estimation above 2 ). For to the case where X evolves with time. In particular, assume that X1 ⇠ N (X 1 , X 1 each i 1, Xi+1 evolves from Xi as Xi+1 = ↵i Xi + Wi ; For each i 1, ↵i and 2 Wi Wi ⇠ N (0, 2 Wi ) (10.29) are known scalars. Noisy observations Yi are made satisfying Yi = hi Xi + Zi ; 1; i Zi ⇠ N (0, 2 Zi ) (10.30) where, for each i 1, hi is a known scalar. Finally, assume that {Zi ; i and X1 are all independent. 1}, {Wi ; i > 1} For each i 1, we want to find the MMSE estimate of both Xi and Xi+1 conditional on the first i observations, Y1;i = y1;i = y1 , . . . , yi . We denote those estimates as x̂i (y1;i ) and X̂i+1 (y1;i ). We denote the errors in these estimates as ⇠i = x̂i (Y1;i ) Xi and ⇣i+1 = X̂i+1 (Y1;i ) Xi+1 For i = 1, conditional on Y1 = y1 , the solution for x̂1 (y1 ) and the variance 2 ⇠i of the estimation error is given by (10.19) and (10.20), x̂1 (y1 ) = X 1 + 2 ⇠1 = h21 2 X1 h21 2 X1 h1 4 X1 + 2 Z1 2 X1 [y1 2 + h21 X 1 h1 X 1 ] 1 ; (10.31) 2 Z1 2 ⇠1 1 = 2 X1 + h21 2 Z1 (10.32) This means that, conditional on Y1 =y1 , X1 is Gaussian with mean x̂1 (y1 ) and variance ⇠21 . Thus, conditional on Y1 =y1 , X2 = ↵1 X1 + W1 is Gaussian with mean ↵1 x̂1 (y1 ) and variance 2 . This means that the MMSE estimate x̂ (y ) and the error variance 2 for X ↵12 ⇠21 + W 2 1 2 ⇣2 1 conditional on Y1 =y1 are given by x̂2 (y1 ) = ↵1 x̂1 (y1 ) ; 2 ⇣2 = ↵12 2 ⇠1 + 2 W1 (10.33) In the restricted space Y1 = y1 , we now want to estimate X2 from the additional observation 2 Y2 = y2 . We can use (10.17) again, with X̂2 (y1 ) in place of X and ⇣2 in place of X 1 x̂2 (y1;2 ) = x̂2 (y1 ) + 2 ⇠2 = 2 ⇣2 h22 h22 2 ⇣2 4 ⇣2 + 2 ⇣2 [y2 h22 ⇣22 h2 2 Z2 ; 1 ⇠2 h2 x̂2 (y1 )] + = (10.34) 2 Z2 1 2 ⇣2 + h22 2 Z2 (10.35) In general, we can repeat the arguments leading to (10.33, 10.34, 10.35) for each i, leading to x̂i (y1;i 1) = ↵i 1 X̂i 1 (y1;i 1 ) (10.36) 477 10.3. LINEAR LEAST EQUARES ESTIMATION = ↵i2 2 ⇣i x̂i (y1;i ) = x̂i (y1;i 2 ⇠i = 1) + h2i 2 ⇣i h2i 2 1 ⇠i 2 ⇣i hi 4 ⇣i + 1 2 Zi + 2 Wi (10.37) 1 2 hi X̂i (y1;i 1 )] ⇣i [yi 2 2 hi ⇣i + Z2 i 1 ; 2 ⇠i = 1 2 ⇣i + h2i 2 Zi (10.38) (10.39) These are the scalar Kalman filter equations. The idea is that one “filters” the observed values Y1 , Y2 , . . . to generate the MMSE estimate of the “signal”X1 , X2 , . . . . In principle, the variance terms, ⇣2i and ⇠2i could be precomputed, so that the estimates are linear in the observations. 2 are all independent An important special case of this result occurs where hi , ↵i , Z2 i , and W i of i, with 0 < ↵ < 1. As we will see later, the discrete process {Xn ; n 1} is a Markov process as well as a Gaussian process, and is usually referred to as Gauss Markov. As n 2 /(1 ↵2 ). We becomes large, the mean of Xn approaches 0 and the variance approaches W would expect that if ↵ is very close to 1, we would be able to estimate Xn quite accurately since an independent measurement is taken each unit of time and the process changes very slowly. To investigate the steady state behavior under these circumstances, we substitute (10.37) into (10.39), getting 1 2 ⇠i = 1 ↵2 ⇠2i 1 + 2 W + h2 (10.40) 2 Z It is not hard to see that ⇠2i monotonically approaches a limiting value , and that satisfy the quadratic equation ↵2 h2 10.3 Z 2 2 + [h2 2 2 W Z +1 ↵2 ] 2 W =0 must (10.41) Linear least equares estimation In many non-Gaussian situations, we want an estimate X̂(y ) that minimizes the expected mean square error Z h i ⇥ ⇤ E k X X̂(y ) k2 | Y =y fY (y ) dy = E k X x̂(Y ) k2 (10.42) subject to the condition that x̂(y ) is affine in y (i.e., x̂(y ) = Ay + b for some matrix A and vector b). Such an estimate is called a linear least squares error estimate (LLSE estimate) (this should be called affine rather than linear, but the terminology is too standard to change). Since x̂(y ) is restricted to have the ⇥ form Ay + b for ⇤ some matrix A and vector b, we want to find A and b to minimize E k X AY b k2 . For any choice of A and b, this expectation involves only the first and second moments (including cross moments) of the components of X and Y . This means that if X , Y have the first and second moments 478 CHAPTER 10. ESTIMATION E [X ] , E [Y ] , KX , KY , KX Y , and if U , V are other rv’s with the same first and second moments, then ⇥ ⇤ ⇥ ⇤ E k X AY b k2 = E k U AV b k2 . for all A and b. These expressions are then minimized by the same A and b, and these minimizing values are functions only of those first and second moments. For any X , Y , the MMSE estimate minimizes (10.42) without the constraint that the estimate is affine, and the LLSE estimate minimizes (10.42) with the affine constraint. Thus the mean square error in the MMSE estimate must be less than or equal to the mean square error for the LLSE estimate. For the case in which X , Y are jointly Gaussian, the MMSE estimate is affine, and thus is the same as the LLSE estimate. Thus for the Gaussian case, the LLSE estimate is achieved with A = KX Y KY 1 ; b = E [X ] KX Y KY 1 E [Y ] (10.43) For any non-Gaussian X and Y , there are Gaussian rv’s with the ⇥ ⇤ same first and second moments, and thus the minimizing value of E k X AY b k2 is the same for the nonGaussian and Gaussian cases, and the optimizing A and b are given in (10.43). Thus (10.7) and (10.10) give the LLSE estimate for both the Gaussian and non-Gaussian cases as well as the MMSE estimate for the Gaussian case. There is an important idea here. The mean square error using the LLSE estimate is the same for the non-Gaussian and Gaussian case, assuming the same first and second moments. For the Gaussian case, the MMSE estimate is the same as the LLSE estimate, so the mean square error is also the same. For the non-Gaussian case, the mean square error for the MMSE estimate is less than or equal to that for the LLSE estimate (since the MMSE estimate is not constrained to be affine). Thus the mean square error (using a MMSE estimate) for the non-Gaussian case is less than or equal to the mean square error (using MMSE) for the Gaussian case. Thus, for the mean square error criterion, Gaussian rv’s are the most difficult to estimate. What is more, an estimator, constructed under the assumption of Gaussian rv ’s, will work just as well for non-Gaussian rv ’s with the same first and second moments. Naturally, an estimator constructed to take advantage of the non-Gaussian statistics might do even better. The LLSE error covariance in the non-Gaussian case, E [(X x̂(y ))(X x̂(y ))T | Y =y ] will generally vary with y . However, when this is averaged over y , E [( X x̂(y ))(X x̂(y ))T ] depends only on first and second moments, so this average error covariance matrix is given by (10.8). We will rederive (10.7), (10.8), and (10.10) later for LLSE estimation using the first and second moments directly. The important observation here is that we can either use first and second moments directly to find LLSE estimates, or we can use the properties of Gaussian densities to find MMSE estimates for Gaussian problems, but we get the same answer in each case. Thus, we can solve either problem we wish and get the solution to the other as a bonus. This is summarized in the following theorem: Theorem 10.3.1. If X and Y are jointly non-singular random vectors, the LLSE estimate of X from the observation Y = y is given by (10.10) and, h for the zero mean case, by i (10.7). T The error covariance, averaged over Y, i.e., KX|Y = E (X x̂(Y))(X X̂(Y)) is given by (10.8). 479 10.4. FILTERED VECTOR SIGNAL PLUS NOISE 10.4 Filtered vector signal plus noise Consider estimating X from an observed sample of Y where Y = HX + Z . We assume that X and Z are independent zero mean Gaussian random vectors and H is an arbitrary matrix. To start with, we first explore the simpler case where H is an identity matrix, i.e. where Y = X + Z . Assume that the covariance matrices KX and KZ are each non-singular. Since X and Y are zero mean jointly Gaussian, we know from (10.7) that X̂(y ) = KX Y KY 1 y . Since Y = X +Z , it follows that KX Y = KX , and KY = KX +KZ . We then have x̂(y) = KX (KX + KZ ) 1 (10.44) y Similarly, from (10.8), KX |Y = KX T KX Y KY 1 KX Y = KX KX KY 1 KX (10.45) Factoring KX KY 1 out of the right side of (10.45), we get KX |Y = KY 1 [KY KX ] = KX KY 1 KZ = KX [KX + KZ ] 1 KZ (10.46) Inverting both sides of this equation, KX 1|Y = KZ 1 [KX + KZ ]KX 1 = KZ 1 + KX 1 (10.47) Substituting (10.46) into (10.44), we get x̂(y ) = KX |Y KZ 1 y (10.48) Now consider the more general case where Y = HX +Z where H is an arbitrary matrix. In a very real sense, this is the canonic estimation problem, making noisy observations of linear functions of the variables to be estimated. In communication terms, we can view X as the samples, at some sufficiently high sampling rate, of the input to a communication system, taken over some interval of time. H then represents the filter (perhaps time varying) that the input passes through, and Z represents samples of the noise that are added to the filtered signal. The conditional probability fX |Y (x | y ), for any given y , is the Gaussian density T T T N (KX Y KY 1 y , KX KX Y KY 1 KX Y ). Since KX Y = KX H and KY = HKX H + KZ , the MMSE estimate of X is x̂(y ) = KX H T [HKX H T + KZ ] KX |Y = KX 1 (10.49) y KX H T [HKX H T + KZ ] 1 HKX (10.50) From Eqs. (41) and (42) of the notes on Gaussian rv’s, alternative forms are x̂(y ) = KX |Y H T KZ 1 y KX |Y = [KX 1 + H T KZ 1 H] (10.51) 1 (10.52) 480 10.4.1 CHAPTER 10. ESTIMATION Alternative derivation of (10.51,10.52) In many applications, the dimension m of X is much smaller than the dimension n of Y and Z . Also the components of Z are often independent, so that KZ is easy to invert. In those applications, (10.51) and (10.52) are much easier to work with than (10.49) and (10.50). We now re-derive (10.51) and (10.52) in a way that provides more insight into their meaning. Starting from first principles, Y , conditional on a given X = x , is a Gaussian vector with covariance KZ and mean Hx . The density, or likelihood, is fY |X (y | x ) = = exp exp 1 2 (y Hx )T KZ 1 (y Hx ) p (2⇡)n/2 det KZ ⇥ T ⇤ 1 1 (Hx )T KZ 1 y y T KZ 1 Hx + (Hx )T KZ 1 Hx 2 y KZ y p (10.53) (2⇡)n/2 det KZ Note that the second and third terms in the exponent above are transposes of each other and are one dimensional; thus they must be the same. Also the first term is a function only of y , so we can combine it and the denominator into some function f (y ), yielding ✓ ◆ 1 1 1 T T T fY |X = f (y ) exp x H KZ y (Hx ) KZ Hx (10.54) 2 The likelihood ratio is then ⇤(y , x ) = fY |X (y | x )/fY |X (y | x o ) where x o is some fixed vector often taken to be 0 . Recall that for binary detection, the likelihood ratio was a function of y only, whereas here, it is a function of both y and x . From (10.54), with xo = 0, ✓ ◆ 1 1 1 T T T ⇤(y , x ) = exp x H KZ y (Hx ) KZ Hx (10.55) 2 The ML estimate, x̂M L (y ) can be calculated from the likelihood ratio by maximizing ⇤(y , x ) over x for the given observation y . Similarly, the MAP estimate can be calculated by maximizing ⇤(y , x )fX (x ) over x . Also, fX |Y (x | y ) can be found from fX |Y (x | y ) = R ⇤(y , x )fX (x ) ⇤(y , x 0 )fX (x 0 )dx 0 (10.56) The advantage of using the likelihood ratio is that f (y ) in (10.54) is eliminated. Note also that the likelihood ratio does not involve the apriori distribution on X , and thus, like fY |X (y | x ), it can be used both where fX (x ) is unknown or where we want to look at a range of choices for fX (x ). For estimation problems, as with detection problems, a sufficient statistic is some function of the observation y from which the likelihood ratio can be calculated for all x . Thus, from (10.55), we see that H T KZ 1 y is a sufficient statistic for x conditional on y . H T KZ 1 y has the dimension m, the same as x , whereas y has dimension n. If n > m, as is common in communication situations, then the sufficient statistic H T KZ 1 y has reduced the n dimensional observation space down to an m dimensional space from which the estimate or 481 10.4. FILTERED VECTOR SIGNAL PLUS NOISE decision can be made. It is important to recognize that nothing is lost if H T KZ 1 y is found from the observation y and then y is discarded; all of the relevant information has been extracted from y . There are usually many choices of sufficient statistics. For example, the observation y itself is a sufficient statistic, and any invertible transformation of y is also sufficient. Also, any invertible transformation of H T KZ 1 y is a sufficient statistic. What is important about H T KZ 1 y is first, that for n > m, the dimensionality has been reduced from the raw observation y , and second, the operator H T KZ 1 is a matched filter; this will be discussed shortly. The estimate in (10.51) can be viewed as first finding the sufficient statistic v = H T KZ 1 y , and then finding the MMSE from v ; i.e., X̂(y ) = KX |Y v . We now try to understand why the sufficient statistic v is operated on by KX |Y to find x̂. Using v = H T KZ 1 y in (10.54), fX Y (x y ) can be expressed as ✓ ◆ 1 T T 1 T 1 1 T fX Y (x y ) = fX (x )fY |X (y | x ) = f (y ) exp x v x H KZ Hx x KX x (10.57) 2 2 where the denominator part of fX (x ) has been incorporated into f (y ). Letting f 0 (y ) = f (y )/fY (y ), ✓ ◆ ⇤ 1 T⇥ T fX |Y (x | y ) = f 0 (y ) exp x T v x H KZ 1 H + KX 1 x (10.58) 2 Defining B = H T KZ 1 H + KX 1 , we can complete the square in the exponent of (10.58), ✓ ◆ 1 T 00 1 1 fX |Y (x | y ) = f (y ) exp x B v B(x B v ) (10.59) 2 where the quadratic term in v has been incorporated into f 00 (y ). This makes it clear that for any given y , and thus any given v = H T KZ 1 y , fX |Y (x | y ) is N (B 1 v , B 1 ). This confirms (10.51) and (10.52), i.e., that B 1 is the covariance of the estimation error, and B 1 v is the MMSE estimate. Thus this gives us nothing new, but the derivation is more insightful than the strictly algebraic derivation in the Gaussian rv notes. 10.4.2 Estimate of a single rv in IID noise In trying to obtain more insight, we look at a number of special cases. First, let X be a random variable rather than vector so that Y = hX + Z . We have n observations, Yi = hi X + Zi , and must estimate X from the set of observations. If the noise variables {Zi ; 1 i n} are IID with mean 0 and variance Z2 , then, from (10.51), x̂(y ) = KX|Y n X hj yj j=1 2 X ; 1 KX|Y = 1 2 X + n X h2i i=1 2 Z (10.60) Note that this agrees with (10.27) and (10.28). In terms of the communication example, we can visualize x as being a sample of the signal X which is filtered by a discrete filter with taps (h1 , h2 , . . . , hn ), giving rise to a sequence (h1 x, h2 x, . . . , hn x). The received signal y 482 CHAPTER 10. ESTIMATION is then {yi = hi x + zi ; 1 i n}. By definition, the filter “matched” to the filter with taps (h1 , . . . , hm ) is a filter with taps (hn , hn 1 , . . . , h2 , h1 ). After (y1 , y2 , . . . , yn ) passes through this filter, the output sample is h1 y1 + h2 y2 + . . . + hn yn . This sum is a sufficient statistic, and as seen in (10.60), x̂ is simply a scaled version of this sum. Figure 1 illustrates this for n = 2. Figure 1: Illustration of one dimensional signal x and 2 dimensional observation y where y1 = 2x + z1 and y2 = x + z2 and (z1 , z2 ) is a sample value of IID Gaussian noise variables. The concentric circles are regions where fY |X (y|x) is constant for given x. The straight line through the origin is the locus of points (2u, u) as u varies, and the other straight line is the perpendicular to the first line; it is the locus of points where 2y1 + y2 is constant. Thus the sufficient statistic 2y1 + y2 can be viewed as projecting (y1 , y2 ) onto the straight line (2u, u), and the resulting u is a constant multiple of 2y1 + y2 , and thus also a sufficient statistic. 10.4.3 Estimate of a single rv in arbitrary noise As a slightly more general case, let Y = hX + Z again, but suppose Z has an arbitrary covariance matrix KZ and let A be a matrix such that AAT = KZ . We can then visualize Z as AW where W is N (0 , I). Let Y 0 = A 1 Y = A 1 hX + W . This can be viewed as the output filtered to make the noise white. Passing this through a filter matched to A 1 h then yields h T (A 1 )T Y 0 , which is h T KZ 1 Y . We view hT KZ 1 to be a matched filter for this more general situation, and we see that again it is a sufficient statistic. Note that (10.60) gives the MMSE estimate of X, which is the same as the MAP estimate, which maximizes fY |X (y | x)fX (x). The maximum likelihood (ML) estimate, on the other hand, maximizes fY |X (y | x) for a given y . The two are the same if fX (x) is uniform, but we cannot make fX (x) uniform over the infinite range of possible values for X. The two become equal in the limit, however, as we let the variance of X approach infinity. Thus the ML estimate is given by setting X2 = 0 in (10.60). In detection problems where a finite set of apriori equiprobable values exist for X, one minimizes error probability by choosing the possible x that maximizes fY |X (y | x). With Gaussian noise, this is the possible value closest to the ML estimate. 10.4.4 Orthnormal expansions of input and output Next consider the situation Y = HX +Z where X = (X1 , . . . , Xm )T has m IID components each of variance 2 and Z = (Z1 , . . . , Zn )T has n m IID components each of variance 1. We could view each component Xi of X as passing through a filter h i = (h1,i , . . . , hn,i ), where h i is the ith column of H, and the solution in (10.51) passes the observation y through n matched filters h Ti y in forming the MMSE estimate of X . However, in (10.51) the matched filter outputs are then operated on by KX |Y , which somehow takes care of scaling and of the interference between the di↵erent components of X . To get another viewpoint of why KX |Y is the right operation in (10.51), we rederive (10.51) in yet another way, through viewing the operator H in terms of orthonormal expansions of input and output. Note that H T H is a symmetric m by m matrix, and it has m eigenvalue, 483 10.4. FILTERED VECTOR SIGNAL PLUS NOISE eigenvector pairs, H TH i = i i; 1im (10.61) The eigenvalues need not be distinct, and need not be non-zero, but they are all nonnegative, and we can always choose the set of i to be orthonormal, which we henceforth assume. The matrix F with columns 1 , 2 , . . . , m has the property that F T F = I (since T 1 . Assume that i j = ij ) so F = F i > 0 for 1 i m and define the n dimensional 1/2 vectors ✓i to be i H i . These vectors must be orthonormal (see Exercise 10.3) since the vectors i are orthonormal, and each ✓i is an eigenvector, with eigenvalue i , of HH T (which is an n by n matrix). Let Q be the n by m matrix made up of the columns ✓1 , . . . , ✓m . Then Q = HF ⇤ 1/2 where ⇤ is the diagonal matrix with elements 1 , . . . , m . We then have X 1/2 T H = Q⇤1/2 F T = ✓i i (10.62) i i Y = Q⇤1/2 F T X + Z (10.63) Now note that, because of the orthonormality of {✓i }, QT Q = I (even though Q is n by m), and thus QT Y = ⇤1/2 F T X + QT Z (10.64) Now define F T X as X . That is, X is the random vector X in the basish 1 , . . . ,i n , i.e., the random vector with components { Ti X ; 1 i m}. Note that E X X T = E [F T X X T F ] = 2 F T IF = 2 I. Similarly, define QT Z as Z ✓ . By the same argument, E [Z ✓ Z T✓ ] = I. Finally define QT Y as Y ✓ . We then have Y ✓ = ⇤1/2 X + Z ✓ . Since the components of both X and Z ✓ are independent, we have m independent one dimensional estimation problems. Note that Z ✓ does not fully represent the noise Z (see Exercise 10.3), but the remaining n m components of the noise are independent of X and of Z ✓ , so they are irrelevant and can be ignored. From (10.15) and (10.16), the MMSE estimate of each component of X is given by p 2 (y ) ⇥ ⇤ i ✓ i 2 (x̂ (y ✓ ))i = ; E | (X ) (x̂ (y )) | = [ 2 + i] 1 (10.65) i i ✓ 2+1 i In vector notation, this becomes x̂ (y ✓ ) = KX |Y ✓ p ⇤ y✓ ; KX |Y ✓ =[ 2 I + ⇤] (10.66) 1 Finally we change back to the original basis. Since X = F X , we have x̂(y ) = F x̂ (y ✓ ). Also y ✓ = QT y , so p p x̂(y ) = F KX |Y ✓ ⇤ QT y = (F KX |Y ✓ F T )(F ⇤QT )y (10.67) Finally, note that ⇥ KX |Y = E | X h ⇤ x̂(y ) |2 = E F (X x̂ )(X T i X̂ T )F T = F KX |Y ✓ F T (10.68) 484 CHAPTER 10. ESTIMATION Also, from (52), H T = F ⇤1/2 QT . Substituting this and (10.68) into (10.67) x̂(y ) = KX |Y H T y (10.69) This agrees with (10.51), since KZ = I. Finally, from (10.68) and then (10.66), KX 1|Y = F (KX |Y ✓ ) 1 FT = F( 2 I + ⇤)F T = 2 I + H TH which agrees with (10.52). For the more general case when X is not IID, one can transform X into IID variables, X 0 = AX , and then apply the above approach to X 0 . Similarly, if Z is not IID, one can first whiten it into IID variables W = BZ . Not much would be gained by this, since we already know the answer from (10.49) and (10.51), and the purpose of these examples was to make those results look less like linear algebra magic. As a final generalization, assume Y = HX + Z as before, but now assume X has a mean, X. Thus E [Y ] = HX. Applying (10.49) and (10.51) to the fluctuations, X X and Y HX, we have the alternative forms x̂(y ) = X + KX H T [HKX H T + KZ ] x̂(y ) = X + KX |Y H T KZ 1 (y 1 (y HX) (10.70) (10.71) HX) The error covariance is still given by (10.50) and (10.52). 10.4.5 Vector recursive estimation We now make multiple noisy vector observations, y 1 , y 2 , . . . , of a single m dimensional Gaussian rv X ⇠ N (X, KX ). For each i, we assume the observation random vectors have the form Y i = Hi X + Z i ; Z i ⇠ N (0 , KZ i ) ; 1 i (10.72) We assume that X , Z 1 , Z 2 , . . . are mutually independent and that H1 , H2 , . . . are known matrices. For each i, we want to find the MMSE estimate of X based on {y j ; 1 j i}. We will do this recursively, using the estimate based on y 1 , . . . , y i 1 to help in finding the estimate based on y 1 , . . . , y i . Let Y 1;i denote the first i observation rv’s and let y 1;i denote the corresponding i sample observations. Let x̂(y 1;i ) and KX |Y 1:i denote the MMSE estimate and error covariance respectively based on y 1;i = y 1 , y 2 , . . . , y i . For i = 1, the result is the same as that in (10.70) and (10.71), yielding x̂(y 1 ) = X + KX H1T [H1 KX H1T + KZ 1 ] x̂(y 1 ) = X + KX |Y 1 H1T KZ 11 (y 1 1 (y 1 H1 X) (10.73) (10.74) H1 X) From (10.50) and (10.52), the error covariance is KX |Y 1 = KX KX H1T [H1 KX H1T + KZ 1 ] KX 1|Y 1 = KX 1 + H1T KZ 11 H1 1 H1 KX (10.75) (10.76) 485 10.4. FILTERED VECTOR SIGNAL PLUS NOISE Using the same argument that we used for the scalar case, we see that for each i > 1, the conditional mean of X , conditional on the observation y 1;i 1 , is by definition x̂(y 1;i 1 ) and the conditional covariance is KX |Y 1;i 1 . Using these quantities in place of X and KX in (10.70) and (10.71) then yields h i 1 x̂(y 1;i ) = x̂(y 1;i 1 ) + KX |Y 1;i 1 HiT Hi KX |Y 1;i 1 HiT + KZ i (y i Hi x̂(y 1;i 1(10.77) )) x̂(y 1;i ) = x̂(y 1;i 1) + KX |Y 1;i 1 HiT KZ i1 (y i Hi x̂(y 1;i 1 )) (10.78) From (10.50) and (10.52), the error covariance is given by h i KX |Y 1;i = KX |Y 1;i 1 KX |Y 1;i HiT Hi KX |Y 1;i 1 HiT + KZ i KX 1|Y 1;i = KX 1|Y 1;i 10.4.6 1 1 Hi KX |Y 1;i 1(10.79) + HiT KZ i1 Hi (10.80) Vector Kalman filter Finally, consider the vector case of recursive estimation on a sequence of m dimensional time varying Gaussian rv’s, say X 1 , X 2 , . . . . Assume that X 1 ⇠ N (X 1 , KX 1 ), and that X i+1 evolves from X i for i 1 according to the equation X i+1 = Ai X i + W i ; Here, for each i satisfying 1 i (10.81) 1, Ai is a known matrix. Noisy n dimensional observations Y i are made Y i = Hi X i + Z i ; where for each i W i ⇠ N (0, KW i ) ; Z i ⇠ N (0, KZ i ) ; 1 i (10.82) 1, Hi is a known matrix, and KZ i is an invertible covariance matrix. Assume that X 1 , {W i ; i 1} and {Z i ; i 1} are all independent. As in the scalar case, we want to find the MMSE estimate of both X i and X i+1 conditional on Y 1 , Y 2 , . . . , Y i . We denote these estimates as X̂i (y 1;i ) and x̂i+1 (y 1;i ) respectively, and we denote the covariance of the error in these estimates as KX i |Y 1;i and KX i+1 |Y 1;i respectively. For i = 1, the problem is the same as in (10.73-10.76). Alternate forms for the estimate and error covariance are x̂1 (y 1 ) = X 1 + KX 1 H1T [H1 KX 1 H1T + KZ 1 ] x̂1 (y 1 ) = X 1 + KX 1 |Y 1 H1T KZ 11 (y 1 KX 1 |Y 1 = KX 1 1 (y 1 H1 X 1 ) H1 X 1 ) KX 1 H1T [H1 KX 1 H1T + KZ 1 ] (10.83) (10.84) 1 H1 KX 1 KX 11 |Y 1 = KX 11 + H1T KZ 11 H1 (10.85) (10.86) This means that, conditional on Y 1 =y 1 , X 1 ⇠ N (x̂1 (y 1 ), KX 1 |Y 1 ). Thus, conditional on Y 1 =y 1 , X 2 = A1 X 1 + W 1 is Gaussian with mean A1 x̂1 (y 1 ) and with covariance A1 KX 1 |Y 1 AT1 + KW 1 . Thus x̂2 (y 1 ) = A1 x̂1 (y 1 ) ; KX 2 |Y 1 = A1 KX 1 |Y 1 AT1 + KW 1 (10.87) 486 CHAPTER 10. ESTIMATION Conditional on Y 1 = y 1 , (10.87) gives the mean and covariance of X 2 . Thus, given the additional observation Y 2 = y 2 , where Y 2 = H2 X 2 + Z 2 , we can use (10.70) and (10.71) with this mean and covariance to get the alternative forms ⇥ ⇤ 1 x̂2 (y 1;2 ) = x̂2 (y 1 ) + KX 2 |Y 1 H2T H2 KX 2 |Y 1 H2T + KZ2 (y 2 H2 x̂2 (y 1 ))(10.88) x̂2 (y 1;2 ) = x̂2 (y 1 ) + KX 2 |Y 1;2 H2T KZ 21 (y 2 H2 X 2 (y 1 )) (10.89) The covariance of the error is given by the alternative forms ⇥ ⇤ KX 2 |Y 1;2 = KX |Y 1 KX 2 |Y 1 H2T H2 KX 2 |Y 1 H2T + KZ 2 1 (10.90) H2 KX 2 |Y 1 KX 12 |Y 1;2 = KX 12 |Y 1 + H2T KZ 21 H2 (10.91) Continuing this same argument recursively for all i > 1, we obtain the Kalman filter equations, h i 1 x̂i (y 1;i ) = x̂i (y 1;i 1 ) + KX i |Y 1;i 1HiT Hi KX i |Y 1;i 1 HiT + KZ i (y i Hi x̂i (y 1;i (10.92) 1 )) x̂i (y 1;i ) = x̂i (y 1;i 1) KX i |Y 1;i = KX i Y 1;i KX 1i |Y 1;i = KX 1i |Y 1;i + KX i |Y 1;i HiT KZ i1 (y i 1 1 Hi x̂i (y 1;i 1 )) KX i |Y 1;i 1 HiT [Hi KX i |Y 1;i 1 HiT + KZ i ] (10.93) 1 Hi KX i |Y 1;i + HiT KZ i1 Hi x̂i+1 (y 1;i ) = Ai x̂i (y 1;i ) ; 1 (10.94) (10.95) KX i+1 |Y 1;i = Ai KX i |Y 1;i ATi + KW i (10.96) The alternative forms above are equivalent and di↵er in the size and type of matrix inversions required; these matrix inversions do not depend on the data, however, and thus can be precomputed. 10.5 LLSE and the orthogonality principle In the preceding sections, we have assumed jointly Gaussian vectors and derived MMSE estimates and error covariances. We have also shown that all those same results are valid for LLSE estimation of random vectors, whether or not the random vectors are Gaussian. We now rederive the basic LLSE estimation results directly and interpret these results by a principle known as the orthogonality principle. This provides added insight into the results, and also provides some generalizations. Suppose that X is an arbitrary zero mean random variable and Y is an arbitrary n dimensional zero mean random vector with a non-singular covariance matrix KY . The LLSE estimate of X from Y is an estimate x̂(Y ) of the form x̂(Y ) = ↵ T Y = n X i=1 ↵i Yi (10.97) 487 10.5. LLSE AND THE ORTHOGONALITY PRINCIPLE ⇥ ⇤ where the vector ↵ = (↵1 , . . . , ↵n )T is chosen to minimize E (X ↵ T Y )2 . Expanding this, ⇥ ⇤ 2 E (X ↵ T Y )2 = X 2KXY ↵ + ↵ T KY ↵ (10.98) The derivative of (10.98) with respect to ↵ is this equal to zero, we get ↵ T KY = KXY ; ↵T KY (see Exercise 10.4). Setting 2KXY + 2↵ ↵ T = KXY KY 1 (10.99) Assuming that this stationary point is actually the minimum3 , the LLSE estimate is x̂(Y ) = KXY KY 1 Y (10.100) This is the same answer as that in (10.7) for Gaussian MMSE estimation except that X here is one dimensional. If X is an m dimensional rv, (10.100) can be applied to each component of X , and the result is x̂(Y ) = KX Y KY 1 Y , the same as (10.7). We concluded earlier that LLSE estimation for arbitrary zero mean random vectors must be the same as MMSE for Gaussian rv’s, so we have learned nothing new. The derivation here is simpler and more transparent than that of (10.7), but the approach in (10.7) was necessary to solve the MMSE estimation problem for Gaussian problems, and without understanding Gaussian problems, the restriction to linear estimates would seem somewhat artificial. In order to get some added insight into the linear optimization we have just done, and to obtain some later generalizations, we now take an abstract linear algebra viewpoint. A real vector space V is a set of elements, called vectors, along with two operations, addition and scalar multiplication. Under the addition operation, any two vectors, X 2 V, Y 2 V can be added to produce another vector, denoted X + Y , in V. Under scalar multiplication, any real number ↵ (called a scalar) can be multiplied by any vector X 2 V to produce another vector, denoted ↵X, in V. A real vector space must satisfy the following axioms for all vectors X, Y, Z and all real numbers ↵, : 1. Addition is commutative and associative, i.e. X + Y = Y + X and (X + Y ) + Z = X + (Y + Z). 2. scalar multiplication is associative, i.e., (↵ )X = ↵( X). 3. Scalar multiplication by 1 satisfies 1X = X. 4. The distributive laws hold: ↵(X + Y ) = (↵X) + (↵Y ); (↵ + )X = (↵X) + ( X). 5. There is a zero vector 0 such that X + 0 = X for all X 2 V. 6. For each X 2 V, there is a unique X 2 V such that X + ( X) = 0. The reason for all this formalism is that vector space results can be applied to many situations other than n-tuples of real numbers. Some common examples are polynomials and 3 For those familiar with vector calculus, the second derivitive (Hessian) matrix of (10.98) with respect to ↵ is KY . The fact that this is positive definite guarantees that the stationary point is actually the minimum. We shall soon give a geometric argument that also shows that (10.99) achieves the minimum. 488 CHAPTER 10. ESTIMATION other sets of functions. All that is necessary to apply all the known results about vector spaces to the given situation is to check whether the given set of elements satisfy the axioms above. The particular set of elements of concern to us here is the set of zero mean random variables defined in some given probability space. We can add zero mean random variables to get other zero mean random variables, we can scale random variables by scalars, and it is easy to verify that the above axioms are satisfied. It is important to understand that viewing a random variable as a vector is very different from the random vectors that we have been considering. A random vector Y = (Y1 , Y2 , . . . , Yn )T is a string of n random variables, and is a function from the sample space to the space of n-dimensional real vectors. It is primarily a notational artifice to refer compactly to several random variables as a unit. Here, each random variable Yi ; 1 i/len is viewed as a vector in its own right. An abstract real vector space, as defined above, contains no notion of length or orthogonality. To achieve these notions, we must define an inner product, hX, Y i as an additional operation on a real vector space, mapping pairs of vectors into real numbers. A real inner product vector space is a real vector space with an inner product operation that satisfies the following axioms for all vectors X, Y, Z and all scalars ↵: 1. hX, Y i = hY, Xi 2. ↵hX, Y i = h↵X, Y i 3. hX + Y, Zi = hX, Zi + hY, Zi 4. hX, Xi 0 with equality i↵ X = 0. In a real inner product vector space, two vectors, X and Y , are defined to be orthogonal p if hX, Y i = 0. The length of a vector X is defined to be p k X k= hX, Xi. Similarly, the distance between two vectors X and Y is k X Y k= hX Y, X Y i. For the conventional vector space in which a vector x is an n-tuple of real numbers, x = (x1 , . . . , xn )T , the inner product hx , y i is conventionally taken to be hx , y i = x1 y1 + · · · + xn yn . With this convention, length and orthogonality take on their conventional geometric significance. For the vector space of zero mean random variables, the natural definition for an inner product is the covariance, hX, Y i = E [XY ] (10.101) Note that the covariance satisfies all the axioms for an inner product above. In this vector space, two zero mean random variables are orthogonal if they are uncorrelated. Also, the length of a random variable is its standard deviation. Return now to the problem of finding the LLSE estimate of a zero mean random variable X from the observation of n zero mean random variables Y1 , . . . , Yn . The estimate here is a random variable that is required to be a linear combination of the observed variables, x̂(Y1 , . . . , Yn ) = n X i=1 ↵i Yi (10.102) 10.5. LLSE AND THE ORTHOGONALITY PRINCIPLE 489 Viewing the random variables Y1 , . . . , Yn as vectors in the real inner product vector space discussed above, a linear estimate x̂(Y1 , . . . , Yn ) must then be in the subspace4 S spanned by Y1 , . . . , Yn . The error in the estimate is the vector (random variable) V = X x̂(Y1 , . . . , Yn ). Thus, in terms of this inner product space, the LLSE estimate X̂(Y1 , . . . , Yn ) is that point P 2 S that is closest to X. Finding the closest point in a subspace to a given vector is a fundamental and central problem for inner product vector spaces. As indicated in Figure 2, the closest point is found by dropping a perpendicular from the point X to the subspace. The point where this perpendicular intersects the subspace is called the projection P of X onto the subspace. Formally, P is the projection of X onto S if hP X , Y i = 0 for all Y 2 S. The following well known result summarizes this: Theorem 10.5.1 (Orthogonality principle). Let X be a vector in a real inner product space V, and let S be a finite dimensional5 subspace of V. There is a unique vector P 2 S such that hX P, Y i = 0 for all Y 2 S. That vector P is the closest point in S to X, i.e., k X P k<k X Y k for all Y 2 S, Y 6= P . Figure 2: P is the projection of the vector X onto the subspace spanned by Y1 and Y2 . That is, X P is orthogonal to all points in the subspace, and, as seen geometrically, the distance from X to P is smaller than the distance from X to any other point Y in the subspace. Note that this figure bears almost no resemblance to figure 1. A vector there represented a string of sample values of random variables, whereas the vectors in Figure 2 are individual random variables; sample values do not appear at all. Proof: Initially assume X 2 / S and assume that a vector P 2 S exists such that hX P , Y i = 0 for all Y 2 S. Let Y 2 S, Y 6= P be arbitrary. The figure suggests that the three points X, P, Y form a right triangle with sides U = X P, V = P Y , and hypoteneuse W = P Y . We first show that the squared length of the hypoteneuse, hW, W i, is equal to the sum of the squared lengths of the sides, hU, U i + hV, V i1 . W, U, V are vectors and W = U + V . i.e., X Y = (X P ) + (P Y ). Expanding the inner product, hW, W i = h(U + V ), (U + V )i Since P 2 S and Y 2 S, V = P = hU, U i + 2hU, V i + hV, V i Y 2 S, so hU, V i = 0. Thus hW, W i = hU, U i + hV, V i (10.103) Since hV, V i is positive for all Y 6= P , hW, W i > hU, U i, so P is the unique point in S closest to X, and thus is the unique P 2 S for which hP Y, Xi. We assumed above that X 2 / S. But if X 2 S, we simply choose P = X, and the result follows immediately. 4 A subspace of a vector space V is a subset of elements of V that constitute a vector space in their own right. In particular, the linear combination of any two vectors in the subspace is also in the subspace; for this reason, subspaces are often called linear subspaces. 5 The theorem also holds for infinite dimensional vector spaces if they satisfy a condition called completeness; this will be discussed later. 1 The issue here is not whether the familiar Pythagorean theorem of plane geometry is correct, but rather whether, in an abstract vector space, the definition of orthogonality corresponds to the plane geometry notion of right angle. 490 CHAPTER 10. ESTIMATION The remaining question in completing the proof is whether a P exists such that hX P , Y i = 0 for all Y 2 S. Since S is finite dimensional, we can choose a basis Y1 , Y2 , . . . , Yn for S. Then hX P , Y i = 0 for all Y 2 S i↵ hX P , Yi i = 0 for all i, 1 i n. Representing P as ↵1 Y1 + ↵2 Y2 + . . . + ↵n Yn , the question reduces to whether the set of linear equations hX, Yi i = n X j=1 ↵j hYj , Yi i = 0 ; 1in (10.104) has a solution. But the vectors Y1 , . . . , Yn are linearly independent, so it follows (see Exercise 10.5) that the matrix with elements hYj , Yi i must be non-singular. Thus (10.104) has a unique solution and the proof is complete. Coming back now to LLSE estimation, we have seen that the LLSE estimate x̂(Y1 , . . . , Yn ) is the projection P of X onto the subspace spanned by Y1 , . . . , Yn . Since hX P , Y i = 0 for all Y in the subspace spanned by Y1 , . . . , Yn , it suffices to find that P for which hX P , Yi i = 0 for 1 i n. Equivalently, P must satisfy hX, Yi i = hP, Yi i for 1 i n. Since P P = x̂(Y1 , . . . , Yn ) = nj=1 ↵j Yj , this means that ↵1 , . . . , ↵n must satisfy (10.104), which can be rewritten for zero mean random variables as n X E [XYi ] = ↵j E [Yj Yi ] ; ;1 i n (10.105) j=1 This agrees (as it must) with (10.99). It turns out that (10.105) must have a solution P whether or not KY is non-singular, and according to the theorem, P = x̂(Y1 , . . . , Yn ) = nj=1 ↵j Yj is uniquely specified. The subtlety in the case where KY is singular is that, although P is uniquely specified, (↵1 , . . . , ↵n ) is not. As usual, however, the singular case is best handled by eliminating the dependent observations from consideration. Next, recall that the above discussion has been restricted to viewing zero mean random variables as vectors. It is sometimes useful to remove this restriction and to consider the entire set of random variables in a given probability space as forming a real inner product vector space. In this generalization, the inner product is again defined as hX, Y i =hE [XYi], but the correlation now includes the mean values, i.e., E [XY ] = E [X] E [Y ] + E f Xf Y . Note that for any random variable X with a non-zero mean, the di↵erence between X and its fluctuation f X is a constant random variable equal to E [X]. This constant random variable maps all sample points into the constant E [X]. This constant random variable must also be interpreted as a vector in the real inner product space under consideration (since the di↵erence of two vectors must again be a vector). Let D be the random variable (i.e., vector) that maps all sample points into unity. Thus X f X = E [X] D is a constant equal to E [X]. The notion of a linear estimate involving variables P with non-zero means is invariably extended to an affine estimate X̂(Y1 , . . . , Yn ) = + i ↵i Yi . Writing this in terms of the unit constant random variable D, x̂(Y1 , . . . , Yn ) = D + n X ↵i Yi (10.106) i=1 In terms of vectors, the right side of (10.106) is a vector within the subspace P spanned by D, Y1 , Y2 , . . . , Yn . We want to choose and ↵1 , . . . , ↵n so that P = D + nj=1 ↵j Yj is the 491 10.5. LLSE AND THE ORTHOGONALITY PRINCIPLE projection of X on the subspace S spanned by D, Y1 , . . . , Yn , i.e., so that hX P, Y i = 0 for all Y 2 S. This condition is satisfied i↵ both hX P, Di = 0 and hX P, Yi i = 0 for 1 i n. Now for any vector Y we have hY, Di = E [Y D] = E [Y ], so hX P, Di = 0 can be rewritten as E [X] = + n X (10.107) ↵i E [Yi ] i=1 Also, hX P, Yi i = 0 can be rewritten as E [XYi ] = E [Yi ]] + n X (10.108) ↵j E [Yi Yj ] j=1 Exercise 10.6 shows that (10.107) and (10.108) are P equivalent to (10.10). Note that in the world of random variables, x̂(Y1 , . . . , Yn ) = + i ↵i Yi is an affine function of Y1 , . . . , Yn , whereas it can also be viewed P as a linear function of D, Y1 , . . . , Yn . In the vector space world, x̂(Y1 , . . . , Yn ) = D + i ↵i Yi must be viewed as a vector in the linear subspace spanned by D, Y1 , . . . , Yn . The notion of linearity here is dependent on the context. Next consider a situation in which we want to allow a limited amount of nonlinearity into an estimate. For example, suppose we consider estimating X as a linear combination of both the constant random variable D, the observation variables Y1 , , Yn , and the squares of the observation variables Y12 , . . . , Yn2 . Note that a sample value yi of Yi also specifies the sample value yi2 of Yi2 and the sample value of D is always equal to 1. Our estimate is then x̂(Y1 , . . . , Yn ) = D + n X (↵j Yj + 2 j Yj ) (10.109) j=1 We wish to choose the scalars , ↵j , and j , 1 j n, to minimize the expected squared error between X and x̂(Y1 , . . . , Yn ). In terms of the inner product space of random variables, we want to estimate X as the projection P of X onto the subspace spanned by the vectors D, Y1 , . . . , Yn , Y12 , . . . , Yn2 . It is sufficient for X P to be orthogonal to D, Yi , and Yi2 for 1 i n, so that hX, Di = hP, Di, hX, Yi i = hP, Yi i and hX, Yi2 i = hP, Yi2 i for 1 i n. Using (10.109), the coefficients , ↵j , and j must satisfy E [X] = + n X (↵j E [Yj ] + j=1 E [XYi ] = E [Yi ] + jE ⇥ 2⇤ Yj ) n X (↵j E [Yi Yj ] + jE j=1 ⇥ ⇤ E XYi2 = n ⇥ ⇤ X ⇥ ⇤ E Yi2 + (↵j E Yi2 Yj + j=1 (10.110) ⇥ ⇤ Yi Yj2 ) jE ⇥ 2 2⇤ Yi Yj ) (10.111) (10.112) This example should make one even more careful about the word “linear.” The projection P is a linear function of the vectors D, Yi , Yi2 , 1 i n, but in the underlying probability space with random variables, Yi2 is a nonlinear function of Yi . 492 CHAPTER 10. ESTIMATION This example can be extended with cross terms, cubic terms, and so forth. Ultimately, if we consider the subspace of all random variables g(Y1 , ..., Yn ), we can view the situation much more simply by viewing X̂(Y1 , . . . , Yn ) as an arbitrary function of the observation Y1 , . . . , Yn . The resulting estimate is the MMSE estimate, which, as we have seen, is the conditional mean of X, conditional on the observed sample values of Y1 , . . . , Yn . As we progress from linear (or affine) estimates to estimates with square terms to estimates with yet more terms, the mean square error must be non-increasing, since the simpler estimates are non-optimal versions of the more complex estimates with some of the terms set to zero. Viewing this geometrically, in terms of projecting from X onto a subspace S, the distance from X to the subspace must be non-increasing as the subspace is enlarged. 10.6 Complex rv’s and inner products A complex random variable (crv) is a mapping from the underlying sample space onto the complex numbers. A crv X can be viewed as a pair of real rv’s, Xr and Xim , where for each sample Xim (!) = =[X(!)]. Thus X = Xr + jXim p point !, Xr (!) = <[X(!)] and where j = 1. The complex conjugate X ⇤ of a crv X is Xr jXim . It is always possible to simply represent each crv as a pair of real rv’s, but this often obscures insights. In what follows, we first look at vectors of crv’s and their covariance matrices. We then look at crv’s as vectors in their own right and extend the orthogonality principle to the corresponding complex inner product space. Let X1 , . . . , Xn be an n-tuple of crv’s. We refer to X = (X1 , . . . , Xn )T as a complex rv. It is a mapping from sample points into n-vectors of complex numbers. The complex conjugate, X ⇤ , of a complex rv X is the vector of complex conjugates, (X1⇤ , . . . , Xn⇤ )T . The covariance matrix of a zero mean complex rv X is defined as h i T KX = E X X ⇤ (10.113) The covariance matrix K of a complex rv has the special property that K = K ⇤T . Such matrices are called Hermetian. A complex matrix is said to be positive definite (positive semi-definite) if it is Hermetian and if, for all complex non-zero vectors b, b T Kb ⇤ > 0( 0). A matrix K is a covariance matrix of a complex rv i↵ K is positive semi-definite. All the eigenvalues of a Hermitian matrix are real, and the eigenvectors q i (which in general are complex) can be chosen to span the space and to be orthogonal (in the sense that q Ti q ⇤j = 0, i 6= j). These properties of Hermetian matrices can be derived in much the same way as the properties of symmetric matrices were derived in section 4 of the chapter on Gaussian random vectors. There is a very important di↵erence between covariance matrices of complex rv’s and real rv’s. The covariance matrix of a complex rv X does not contain all the second moment information about the 2n real rv’s X1,r , X1,im , . . . , Xn,r , Xn,im . For example, in the one dimensional case, we have ⇥ ⇤ 2 E [XX ⇤ ] = E [(Xr + jXim )(Xr jXim )] = E Xr2 + Xim (10.114) 10.6. COMPLEX RV’S AND INNER PRODUCTS 493 This only specifies the sum of the variances of Xr and Xim , but does not specify the individal variances or the covariance. As shown in Exercise 10.7, all of the second moment information can be extracted from the combination of KX and a matrix E [X X T ] sometimes known as the pseudo-covariance matrix. In many of the situations in which complex random variables appear naturally, there is a type of circular symmetry which causes E [X X T ] to be identically zero. To be more specific, a complex rv X is circularly symmetric if X and ej X are statistically the same (i.e., have the same joint distribution function) for all . This implies that each complex random variable Xi , 1 i n, is circularly symmetric in the sense that if the joint density for Xi,r , Xi,im is expressed in polar form, it will be independent of the angle. It also implies that if two complex variables, Xi and Xk , are each rotated by the same angle, then the joint density will again be unchanged. If X is circularly symmetric, then h i E [X X T ] = E (ej X )(ej X )T = e2j E [X X T ] (10.115) This equation can only be satisfied for all if E [X X T ] = 0. Thus for circularly symmetric complex rv’s, the pseudo-covariance matrix is zero and the covariance matrix specifies all the second moments. We also define X and Y to be jointly circularly symmetric if (X T , Y T )T is circularly symmetric. Next, we extend the definition of real vector spaces to complex vector spaces. This is particularly easy since the definition of a complex vector space is the same as that of a real vector space except that the scalars are now complex numbers rather than real numbers. We must also extend inner products. A complex inner product space is a complex vector space V with an inner product hX, Y i that maps pairs of vectors X, Y into complex numbers. The axioms are the same as for real inner product spaces except that for all X, Y 2 V, hX, Y i = hY, Xi⇤ (10.116) As with real inner product p spaces, two vectors, X, Y are orthogonal if langleX, Y i = 0, the length of X is k X k= langleX, Y i, and the distance between X and Y is k X Y k. For the conventional complex vector space in which x is an n-tuple of complex numbers, the inner product hx , y i is conventionally taken to be hx , y i = x1 y1⇤ + · · · + xn yn⇤ . Note that this choice makes hx, xi 0, whereas without the complex conjugates, this would not be true. The complex vector space of interest here is the set of zero mean crv’s defined in some given probability space. The inner product for zero mean crv’s is defined to be the covariance of the crv’s, i.e., hX, Y i = E [XY ⇤ ] (10.117) It can be verified directly that this definition satisfies all the axioms of an inner product. The orthogonality principle can also be applied to complex inner product vector spaces. The proof is a minor modification of the proof in the real case and is left as Exercise 10.8. Theorem 10.6.1 (Orthogonality principle, complex case). Let X be a vector in a complex inner product space V, and let S be a finite dimensional subspace of V. There is a unique vector P 2 S such that hX P, Y i = 0 for all Y 2 S. That vector P is the closest point in S to X, i.e., k X P k<k X Y k for all Y 2 S, Y 6= P . 494 CHAPTER 10. ESTIMATION Consider LLSE estimation of a crv X from the crv’s Y1 , . . . , Yn . The LLSE estimate for crv’s is defined as X x̂(Y1 , . . . , Yn ) = ↵i Yi (10.118) i ⇥ ⇤ where ↵1 , . . . , ↵n are complex numbers chosen to minimize E |X x̂(Y1 , . . . , Yn )|2 . In terms of the complex vector space of zero mean crv’s, the quantity to be minimized is k X x̂(Y1 , . . . , Yn ) k. Thus, from the theorem, x̂(Y1 , . . . , Yn ) is the projection P of X on the space spanned by Y1 , . . . , Yn . Since hX P, Y i = 0 for all Y in the subspace spanned by Y1 , . . . , Yn , it suffices to find that P for which hX P, Yi i = 0 for 1 i n. P Equivalently, P must satisfy hX, Yi i = hP, Yi i for 1 i n. Since P = x̂(Y1 , . . . , Yn ) = nj=1 ↵j Yj , this means that ↵1 , . . . , ↵n must satisfy (10.104). Using (10.117), we have E [XYi⇤ ] = n X ↵j E [Yj Yi⇤ ] ; j=1 ;1 i n (10.119) ⇥ ⇤ If we define KXY as E XY ⇤T , this can be written in vector form as ↵T KY = KXY ; ↵T = KXY KY 1 (10.120) Thus the LLSE estimate x̂(Y ) is given by x̂(Y ) = KXY KY 1 Y , (10.121) This is valid whether or not the variables are circularly symmetric. This is somewhat surprising since the covariance matrices do not contain all the information about second moments in this case. We explain this shortly, but first, we look at LLSE estimation of a vector of complex variables, X , from the observation of Y . Since (10.121) can be used for the LLSE estimate of each component of X , the LLSE estimate of the vector X is given by x̂(Y ) = KX Y KY 1 Y , The estimation error matrix, KX |Y = E (x̂(Y ) KX |Y = KX ⇣ X ) X̂(Y ) ⇤T KX Y KY 1 KX Y (10.122) X ⌘⇤T is then given by (10.123) The surprising nature of these results is resolved by being careful about what linearity means. What we have found in (10.121) is the linear least squares estimate of the crv X based on the crv’s Y1 , . . . Yn . This is not the same as the linear least squares estimate of Xr and Xim based on Yi,r , Yi,im ; 1 i n. This can be seen by noting that an arbitrary linear transformation from two variables Yr , Yim to X̂r , X̂im is specified by a two by two matrix, which contains four arbitrary real numbers. An arbitrary linear transformation from a complex variable Y to a complex variable X̂ is specified by a single complex variable, i.e., by two real variables. 495 10.6. COMPLEX RV’S AND INNER PRODUCTS To further clarify the di↵erence between LLSE estimates on complex variables and LLSE estimates on the real and imaginary parts of the complex variables, consider estimating the crv X as a linear function of both Y and Y ⇤ , i.e., X x̂(Y ) = [↵i Yi + i Yi⇤ ] (10.124) i where ↵i and i are chosen, for each i, to minimize the mean squared error. From the orthogonality theorem, X X ⇤ hX , Yi i = ↵j hYj , Yi i + (10.125) j hYj , Yi i j hX , Yi⇤ i = X j j ↵j hYj , Yi⇤ i + X j ⇤ j hYj , Yi⇤ i (10.126) It can be seen from (10.124) that this estimate must have a real part and imaginary part that are linear functions of the real and imaginary parts of Y . It can also be seen (see Exercise 10.9) that any such linear function of the real and imaginary parts can be represented in this way. Thus the LLSE estimate based on complex variables using both the variables and complex conjugates is equivalent to the LLSE estimate for real and imaginary parts. Now consider the case where X and Y are jointly circularly symmetric. Then hX , Yi⇤ i = 0 and hYj , Yi⇤ i = hYj⇤ , Yi i = 0. In this case, (10.126) implies that j = 0 for each j, and (10.124) is then the same as (10.120). This means that the complex LLSE estimate is the same as the real LLSE estimate for the circularly symmetric case. Similarly if X and Y are jointly circularly symmetric, the complex LLSE estimate in (10.122) is the same as the real LLSE estimate, and the covariance error is also the same. The next question is when the circularly symmetric case arises. Consider the case Y = HX + Z and suppose that X and Z are independent and each circularly symmetric. The first thing to observe is that for any , ej Y = H[ej X ] + ej Z . Thus, since X and ej X have the same distribution and Z and ej Z have the same distribution, Y and ej Y also have the same distribution. Thus Y is circularly symmetric. In the same way, we see that X and Y are jointly circularly symmetric. Thus, in this case, the complex LLSE estimate is again the same as the real LLSE estimate. Finally, define X to be a complex Gaussian rv if <(X ) and =(X ) are jointly Gaussian. Similarly X and Y are jointly complex Gaussian if the real and imaginary parts are collectively jointly Gaussian. For this jointly Gaussian case, the MMSE estimate of X from observation of Y is given by the MMSE estimate of the real and imaginary parts of X from the observation of the real and imaginary parts of Y . This is the same as the LLSE estimate of X as a linear function of both Y and Y ⇤ . Finally, if X and Y are jointly circularly symmetric, the MMSE estimate is also given by (10.122). For the circularly symmetric Gaussian case, it is also nice to express the density of X in terms of the covariance matrix, KX . As shown in Exercises 10.10 and 10.11, this is given by ⇥ ⇤ exp x ⇤ KX 1 x fX (x ) = (10.127) ⇡ n det(KX ) 496 10.7 CHAPTER 10. ESTIMATION Paremeter estimatioon and the Cramer-Rao bound We now focus on estimation problems in which there is no appropriate model for apriori probabilities on the quantity x to be estimated. We view x as a parameter which is known to lie in some interval on the real line. We can view the parameter x as a sample value of a random variable X whose distribution is unknown to us, or we can view it simply as an unknown value. When x is viewed simply as an unknown value, then we don’t have an overall probability space to work with—we only have a probability space for each individual value of x. We can not take overall expected values of random variables, but can only take expected values given particular values of x. Strictly speaking, we can not even regard the observation as a random vector, but can only view the observation as a distinct random vector for each value of x. We will abuse the notation somewhat, however, and refer to the observation as the random vector Y . We denote the expected value of Y , given x, by Ex (Y ), but the overall expected value of Y is undefined. Consider an estimation problem in which we want to estimate the parameter x, and we observe a sample value y of the observation Y . Let f (y |x) denote the probability density of Y at sample value y , given that the parameter has the value x. This is not a conditional probability in the usual sense, since x is only a parameter. We assume that small changes in x correspond to small changes in the density f (y |x), and in fact we assume that f (y |x) and ln[f (y |x)] are di↵erentiable with respect to x. Define Vx (y ) as Vx (y ) = @ ln(f (y |x)) 1 @(f (y |x) = @x f (y |x) @x (10.128) Recall that the maximum likelihood estimate X̂M L (y ) is that value of x that maximizes f (y |x) for the given observation y . x̂M L (y ) also is the x that maximizes ln f (y |x), and if the maximum occurs at a stationary point, it occurs where Vx (y ) = 0. Note that for each x, Vx (Y ) is a random variable, and as shown in the next equation, Vx (Y ) has zero mean for each x. Z Z 1 @f (y |x) Ex [Vx (Y )] = f (y |x)Vx (y ) dy = f (y |x) dy f (y |x) @x R @ f (y |x) dy @1 = = =0 (10.129) @x @x It is not immediately apparent why Vx (Y ) is a fundamental quantity, but as one indication of this, note that if f (y|x) is replaced by the likelihood ratio ⇤(y, x) in (10.128), the derivative remains the same. Thus Vx (y ) is the partial derivative, with respect to x, of the log likelihood ratio LLR(y , x). Now suppose that some value of x is selected (x is simply a parameter, and we assume no apriori probability distribution on it), and an observation y occurs according to the density f (y |x). An observer, given the value y , chooses an estimate x̂(y ) of x. Thus X̂(y ) is a function of y , and thus, for any given parameter value x, x̂(Y ) can be considered as a random variable (conditional on x). The Cramer-Rao bound gives us a lower bound on Ex [(x x̂(Y ))2 ], the second moment of x̂(Y ) around x, given x. One may think 497 10.7. PAREMETER ESTIMATIOON AND THE CRAMER-RAO BOUND of this as a communication channel with input x and output Y , with Y taking on the value y with probability density f (y |x). For each choice of x, one can in principle calculate Ex [(x x̂(Y ))2 ], and this depends on the particular estimate X̂(y ) and also on the particular x. One can make this estimate better for some values of x by making it poorer for others, but no matter how one does this,Ex [(x x̂(Y ))2 ] must, for each x, satisfy a soon to be derived lower bound called the Cramer-Rao bound. The bias7 , bX̂ (x) of an estimate x̂(y ) is defined as bX̂ (x) = Ex [x̂(Y ) x] (10.130) An estimate X̂ is called unbiased if bX̂ (x) = 0 for all x. Many people take it for granted that estimates should be unbiased, but there are many situations in which unbiased estimates do not exist, and others in which they exist but are not particularly desirable (see Exercise 10.12). The Fisher information J(x) is defined as the variance (conditional on x) of Vx (Y ). J(x) = V ARx [Vx (Y )] = Ex [(Vx (Y ))2 ] (10.131) For the example where Y , conditional on x, is N (x, 2 ),Vx (Y ) = (Y x)/ 2 , and thus 2 J(x) = 1/ (see Exercise 10.13). Viewing Y as a noisy measurement of x, we see that the Fisher information gets smaller as the measurement noise gets larger. The major reason for considering the Fisher information comes from the Cramer-Rao bound below, and one is advised to look at the bound and some examples rather than trying to get an abstract sense of why one might call this an information. Theorem 10.7.1 (Cramer-Rao bound). For each x, and any estimator x̂(Y), h i @b (x) 2 1 + X̂ @x V ARx [x̂(Y)] (10.132) J(x) Proof: Ex [Vx (Y )x̂(Y )] = = = = Z f (y |x)Vx (y )x̂(y )dy Z @f (y |x) x̂(y )dy @x Z @ f (y |x)x̂(y )dy @x (interchanging di↵erentiation and integration) @[x + bX̂ (x)] @Ex [x̂(Y )] = @x @x = 1+ 7 (from (10.128)) @bX̂ (x) @x (from (10.130) (10.133) Occasionally, if the parameter x is a sample of a random variable X, bias is defined di↵erently, as E [x̂(Y ) X]. With the first definition, which is the one we use consistently, bias is a function of x, whereas with the second definition, bias is a single number, averaged over X. 498 CHAPTER 10. ESTIMATION Let g Xx (Y ) = x̂(Y ) Ex (X̂(Y )) be the fluctuation in x̂(Y ) for given x. Since Vx (Y ) is zero mean, we know that h i @b (x) Ex Vx Y )g Xx (Y ) = Ex [Vx (Y )x̂(Y )] = 1 + X̂ @x (10.134) Since the normalized covariance between Vx and X̂(Y ) (for given x) must be at most 1, we have Ex2 [Vx (Y )g Xx (Y )] V ARx [Vx (Y )]V ARx [x̂(Y )] = J(x)V ARx [X̂(Y )] (10.135) Combining (10.134) and (10.135) yields (10.132). The usual quantity of interest is the mean square estimation error for any given x, i.e., Ex [(x̂(Y ) x)2 ]. Since this is equal to V ARx [x̂(Y )] + Ex2 [x̂(Y )] x, and since bX̂ = Ex [x̂(Y )] x, (10.132) is equivalent to ⇥ Ex (x̂(Y ) ⇤ x)2 h 1+ i @bX̂ (x) 2 @x J(x) ⇥ ⇤2 + bX̂ (x) (10.136) Finally, if we restrict attention to the special case of unbiased estimates, (10.136) becomes Ex [(x̂(Y ) x)2 ] 1 J(x) (10.137) This is the usual form of the Cramer-Rao inequality. Many people mistakenly believe that unbiased estimates must be better than biased, and that therefore biased estimates also satisfy (10.137), but this is untrue, and as mentioned before, there are many situations in which no unbiased estimates exist. 499 10.8. EXERCISES 10.8 Exercises Exercise 10.1. a) Consider the joint probability density fX,Z (x, z) = e z for 0 x z and fX,Z (x, z) = 0 otherwise. Find the pair x, z of values that maximize this density. Find the marginal density fZ (z) and find the value of z that maximizes this. b) Let fX,Z,Y (x, z, y) be y 2 e yz for 0 x z, 1 y 2 and be 0 otherwise. Conditional on an observation Y = y, find the joint MAP estimate of X, Z. Find fZ|Y (z|y), the marginal density of Z conditional on Y = y, and find the MAP estimate of Z conditional on Y = y. Exercise 10.2. a) Let X, Z1 , Z2 , . . . be independent zero mean Gaussian rv’s with vari2 , 2 , . . . respectively. Let Y = h X + Z for i ances X 1 and let Y i = (Y1 , . . . Yi )T . Use i i i Z1 (10.7) to show that the MMSE estimate of X from Y i = y i = (y1 , . . . , yi )T , is given by x̂(y i ) = i X aj yj ; aj = j=1 hj / Z2 j Pi 2 )+ 2 (1/ X n=1 hn / 2 Zn (10.138) Hint: Leta = KXY KY 1 and multiply by KY to solve for a. b) Use this to show that 1 2 X|Y1;i = 1 2 X + i X h2n / 2 Zn (10.139) n=1 c) Show that the solutions in 10.138 and 10.139 agree with those in 10.24 and 10.25. 1/2 Exercise 10.3. a) Show that the set of vectors {✓i = i H i ; 1im} is an orthonormal set, where the vectors i ; 1im satisfy (10.61) and are orthonormal and i > 0 for 1 i m. b) Show that ✓i is an eigenvector, with eigenvalue c) Show that HH T has n i, of HH T . m additional orthonormal eigenvectors, each of eigenvalue 0. d) Now assume that H T H has only m0 < m orthonormal eigenvectors with positive eigenvalues and has m m0 orthonormal eigenvectors with eigenvalue 0. Show that HH T has n m0 orthonormal eigenvectors with eigenvalue 0. e) Let Y = HX + Z . Show that for each eigenvector ✓i of HH T with eigenvalue 0, ✓iT Y = ✓iT Z and show that the random variable ✓iT Z is statistically independent of ✓jT Z for each eigenvector ✓j with a non-zero eigenvalue. ⇥ ⇤ 2 Exercise 10.4. a) Write out E (X a T Y )2 = X 2KXY a + a T KY a as a function of a1 , a2 , . . . , an and take the partial derivative of the function with respect to ai for each i, 1 i n. Show that the vector of these partial derivatives is 2KXY + 2a T KY . 500 CHAPTER 10. ESTIMATION Exercise 10.5. a) For a real inner product space, show that n vectors, Y1 , . . . , Yn are linearly dependent i↵ the matrix of inner products, {hYj , Yi i; 1 i, j n}, is singular. Exercise 10.6. a) Show that (10.107) and (10.108) agree with (10.10). Exercise 10.7. a) Let ( X = X1 , . . . , Xn )T be a zero mean complex rv with real and imaginary components Xr,j , Xim,j , 1jn respectively. Express E [Xr,j Xr,k ] , E [Xr,j Xim,k ] , E [Xim,j Xim,k ] , E [Xim,j X as functions of the components of KX and E [X X T ]. Exercise 10.8. a) Prove Theorem 5. Hint: Modify the proof of theorem 4 for the complex case. Exercise 10.9. Let Y = Yr + jYim be a complex random variable. For arbitrary real numbers a, b, c, d, find complex numbers ↵ and such that < [↵Y + Y ⇤ ] = aYr + bYi m = [↵Y + Y ⇤ ] = cYr + dYi m Exercise 10.10. (Derivation of circularly symmetric Gaussian density) Let X = X r + jX im be a zero mean circularly symmetric n dimensional Gaussian complex rv . Let U = (X Tr , X Tim )T be the corresponding 2n dimensional real rv . Let Kr = E [Xr XrT ] and Kri = T E [Xr Xim ]. a) Show that 2 KU = 4 b) Show that KU1 = and find the B, C for which this is true. c) Show that KX = 2(Kr Kri ). d) Show that KX 1 = 12 (B jC). h Kr Kr i Kri Kr B C C B 3 5 i e) Define fX (x ) = fU (u) for u = (vecxTr , x Tim )T and show that fX (x ) = . exp x ⇤ KX 1 x T p (2⇡)n det KU 501 10.8. EXERCISES f ) Show that det KU = h Kr + jKri Kr i jKr Kri Kr h Kr + jKri 0 Kri Kr jKri i . Hint: Recall that elementary row operations do not change the value of a determinant. g) Show that det KU = i . Hint: Recall that elementary column operations do not change the value of a determinant. h) Show that det KU = 2 2n (det KX )2 and from this conclude that (10.127) is valid. Exercise 10.11. (Alternate derivation of circularly symmetric Gaussian density). a) Let X be a circularly symmetric zero mean complex Gaussian rv with covariance 1. Show that fX (x) = exp x⇤ x ⇡ Hint: Note that the variance of the real part is 1/2 and the variance of the imaginary part is 1/2. b) Let X be an n dimensional circularly symmetric complex Gaussian zero mean random vector with KX = In . Show that fX (x ) = exp x ⇤T x ⇡n . c Let Y = HX where H is n by n and invertible. Show that ⇥ ⇤ exp y ⇤T H 1⇤T H 1 y fY (y ) = v⇡ n where v is dy /dx , the ratio of an incremental 2n dimensional volume element after being transformed by H to that before being transformed. d Exercise 10.12. Let Y = X 2 + Z where Z is a zero mean unit variance Gaussian random variable. Show that no unbiased estimate of X exists from observation of Y . Hint. Consider any x > 0 and compare with x. 502 CHAPTER 10. ESTIMATION Exercise 10.13. a) Assume that for each parameter value x, Y is Gaussian, N (x, 2 ). Show that Vx (y) as defined in (10.128) is equal to (y x)/ 2 and show that the Fisher information is equal to 1/ 2 . b) Show that the Cramer-Rao bound is satisfied with equality for ML estimation for this 2 ), the MMSE estimate satisfies the Cramer-Rao bound example. Show that if X is N (0, X with equality. Exercise 10.14. a) Assume that Y is N (0, x). Show that Vx (y) as defined in(10.128) is Vx (y) = [y 2 /x 1]/(2x). Verify that Vx (Y ) is zero mean for each x. Find the Fisher information, J(x). Bibliography [1] Bellman, R., Dynamic Programming, Princeton University Press, Princeton, NJ., 1957. [2] D. Bertsekas and J. Tsitsiklis, An Introduction to Probability Theory, 2nd Ed. Athena Scientific, Belmont, MA, 2008. [3] Bertsekas, D. P., Dynamic Programming - Deterministic and Stochastic Models, Prentice Hall, Englewood Cli↵s, NJ, 1987. [4] Bhattacharya, R.N. & E.C. Waymire, Stochastic Processes with Applications, Wiley, NY, 1990. [5] Dembo, A. and O. Zeitouni, Large Deviation Techniques and Applications, Jones & Bartlett Publishers, 1993. [6] Doob, J. L., Stochastic Processes, Wiley, NY, 1953. [7] Feller, W., An Introduction to Probability Theory and its Applications, vol 1, 3rd Ed., Wiley, NY, 1968. [8] Feller, W., An Introduction to Probability Theory and its Applications, vol 2, Wiley, NY, 1966. [9] Gallager, R.G., Principles of Digital communication, Cambridge Press, Cambridge UK, 2008. [10] Gallager, R.G., Discrete Stochastic Processes, Kluwer, Norwell MA 1996. [11] Gantmacher, F. R., The Theory of Matrices, (English translation), Chelsea, NY, 1959 (2 volumes). [12] Grimmett, G. and D. Starzaker, Probability and Random Processes, 3rd ed., Oxford University Press, Oxford G.B., 2001. [13] Harris, T. E., The Theory of Branching Processes, Springer Verlag, Berlin, and Prentice Hall, Englewood Cli↵s, NJ, 1963. [14] Kelly, F.P. Reversibility and Stochastic Networks, Wiley, NY, 1979. [15] Kolmogorov, A. N., Foundations of the Theory of Probability, Chelsea NY 1950 (Translation of Grundbegri↵e der Wahrscheinlchksrechnung, Springer Berlin, 1933). 503 504 BIBLIOGRAPHY [16] Neyman, J. and E. S. Pearson, “On the use and interpretation of certain test criteria for purposes of statistical inference,” Biometrica, 20A, 1928 pp 175-263, 1928 [17] Ross, S., A First Course in Probability, 4th Ed., McMillan & Co., 1994. [18] Ross, S., Stochastic Processes, 2nd Ed., Wiley, NY, 1983. [19] Rudin, W. Real and Complex Analysis, McGraw Hill, NY, 1966. [20] Rudin, W. Principles of Mathematical Analysis 3rd Ed., McGraw Hill, NY, 1975. [21] Schweitzer, P. J. & A. Federgruen, “The Asymptotic Behavior of Undiscounted Value Iteration in Markov Decision Problems,” Math. of Op. Res., 2, pp 360-381, Nov.1967. [22] Shiryaev, A. N. Probability, 2nd Ed., Springer, NY, NY, 1995. [23] Strang, G. Linear Algebra, 4th Edition, Wellesley-Cambridge Press, Wellesley, MA 2009. [24] Wald, A. Sequential Analysis, Wiley, NY, 1947. [25] Wol↵, R. W. Stochastic Modeling and the Theory of Queues, Prentice Hall, Englewood Cli↵s, NJ, 1989. [26] Yates, R. D. & D.J. Goodman, Probability and Stochastic Processes, Wiley, NY, 1999. [27] Yates, R. D., High Speed Round Robin Queueing Networks, LIDS-TH-1983, M.I.T., Cambridge MA, 1983.