* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download MA 575 Linear Models: Cedric E. Ginestet, Boston University
Determinant wikipedia , lookup
Matrix (mathematics) wikipedia , lookup
Jordan normal form wikipedia , lookup
Covariance and contravariance of vectors wikipedia , lookup
Eigenvalues and eigenvectors wikipedia , lookup
Perron–Frobenius theorem wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Orthogonal matrix wikipedia , lookup
Gaussian elimination wikipedia , lookup
Singular-value decomposition wikipedia , lookup
Cayley–Hamilton theorem wikipedia , lookup
System of linear equations wikipedia , lookup
Four-vector wikipedia , lookup
Principal component analysis wikipedia , lookup
Matrix multiplication wikipedia , lookup
MA 575: Linear Models MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 1 1.1 Revision: Probability Theory Random Variables A real-valued random variable is a function from a probability space (Ω, F, P), to a given domain (R, B). (The precise meanings of these spaces are not important for the remainder of this course.) Strictly speaking, therefore, a value or realization of that function can be written for any ω ∈ Ω, X(ω) = x. For notational convenience, we often omit any reference to the sample space, Ω. However, we still systematically distinguish a random variable from one of its realizations, by using upper cases for the former and lower cases, for the latter. A realization from a random variable is also referred to as an observed value. Note that this upper/lower case convention is not respected by Weisberg, in the main textbook for this course. 1.2 Expectation Operator The expectation or expected value of a real-valued random variable (r.v.), X with probability density function (pdf) p(x), defined over the real line, R, is given by Z E[X] := x p(x) dx. R This corresponds to the “average value” of X over its domain. If the random variable X were to be defined over a discrete space X , we would compute its expected value by replacing the integral with a cumulative summation, as follows, X E[X] := xP[X = x]. x∈X Since the expectation operator, E[·], takes into account all of the values of a random variable, it therefore acts on an upper case X, and not on a single realization, x. A similar notation is adopted for the variance and covariance operators, below. Crucially, the expectation operator, E[X], is a linear function of X. For any given real numbers α, β ∈ R, we have E[α + βX] = α + βE[X]. Department of Mathematics and Statistics, Boston University 1 MA 575: Linear Models This linear relationship can be extended to any linear Pn combination of random variables, X1 , . . . , Xn , such that if we are interested in the expectation of α + i=1 βXi , we obtain, " # n n X X E α+ βXi = α + β E[Xi ]. i=1 1.3 i=1 Variance Operator The variance of a random variable X is defined as the expected squared differences between the observed values of X and its mean value. Var[X] := E[(X − E[X])2 ] =: E[X − E[X]]2 . Since the Euclidean distance between two real numbers, a and b, is defined as d(a, b) := |a − b|, one may geometrically interpret the variance as the average squared distance of the x’s, from the mean value, E[X]. The Pnvariance operator is non-linear. Given a linear combination of uncorrelated random variables, λ0 + i=1 βi Xi , we have " # n n X X Var α0 + βi Xi = βi2 Var[Xi ]. i=1 i=1 The first term, α0 , has been eliminated because the variance of a constant is nil, since E[α0 ] = α0 . Finally, the standard deviation of a variable X is defined as p σX := Var[X]. 1.4 Covariance and Correlation The covariance of two random variables, X and Y , is defined as the expected product of differences between the observed values of these two random variables, and their respective mean values. Cov[X, Y ] := E [(X − E[X]) (Y − E[Y ])] = Cov[Y, X], since the covariance can be seen to be symmetric, by inspection. This quantity describes how two random variables vary jointly. As for the variance operator, the covariance is non-linear, such that for any linear transformations of two random variables X and Y , we obtain Cov [αx + βx X, αy + βy Y ] = βx βy Cov[X, Y ]. Also, observe that the covariance operator is a generalization of the variance, since Cov[X, X] = Var[X]. The Pearson product-moment correlation coefficient between random variables X and Y is defined as the covariance between X and Y , standardized by the product of their respective standard deviations. ρ(X, Y ) := p Cov[X, Y ] Var[X] · Var[Y ] . The correlation coefficient is especially valuable, because (i) it does not depend on units of measurement, and (ii) because its values are comprised between −1 and 1. This can be easily proved by an application of the Cauchy-Schwarz inequality, which states that | hx, yi |2 ≤ hx, xi · hy, yi, for any x, y ∈ Rd . Department of Mathematics and Statistics, Boston University 2 MA 575: Linear Models 1.5 Random Samples We will often consider random samples from a given population, whose moments are controlled by some unknown parameters, such as a mean µ and a variance σ 2 . Such a sample is commonly denoted as follows, iid Xi ∼ f (µ, σ 2 ), ∀ i = 1, . . . , n, where note that we are using an upper case, on the left-hand side, for every index i from 1 to n. The variance operator of a linear combination of random variables is determined by the relationships of the random variables in that sequence. i. Independent and Identically Distributed (IID) random variables: " # n n X X Var α0 + βi Xi = βi2 σ 2 . i=1 i=1 ii. Independent, but not identically distributed random variables: " # n n X X Var α0 + βi Xi = βi2 Var[Xi ]. i=1 i=1 iii. Neither independent, nor identically distributed random variables: # " n n−1 n n X X X X βi βj Cov(Xi , Xj ), βi2 Var[Xi ] + 2 βi Xi = Var α0 + i=1 i=1 j=i+1 i=1 where the two in the second term comes from the fact that the matrix of covariances is symmetric. 1.6 Conditional Moments Any conditional expectation is itself a random variable. Given two random variables Y and X, where X takes values in some (metric) space X , with σ-algebra BX , X : (Ω, F, P) 7→ (X , BX ); the conditional expectation of Y given X corresponds to the mapping, EY |X : (Ω, F, P) 7→ (R, B). Thus, for any ω ∈ Ω, with X(ω) = x, we have the following equivalence, EY |X [Y |X = X(ω)] = EY |X [Y |X = x]. The essential rules governing the manipulations of conditional moments are as follows: 1. The law of total probability: P[Y ] = X P[Y |X = x]P[X = x]. x∈X 2. The law of total expectation, or power rule: Z EY [Y ] = EX EY |X [Y |X] = EY |X [Y |X = x] P[X = x] dx. X 3. The law of total variance, or variance decomposition: VarY [Y ] = EX VarY |X [Y |X] + VarX EY |X [Y |X]. Department of Mathematics and Statistics, Boston University 3 MA 575: Linear Models 2 Revision: Linear Algebra 2.1 Basic Terminology 1. A matrix X is an arrayed arrangement of elements (X)ij = xij . 2. A matrix X is of order r × c, if it has r rows and c columns, with i = 1, . . . , r and j = 1, . . . , c. 3. A column vector is a matrix of order (r × 1). (All vectors used in this course will be assumed to be column vectors). 4. A matrix is said to be square if r = c. 5. A matrix is symmetric, if xij = xji , for all i = 1, . . . , r, and j = 1, . . . , c. 6. A square matrix is diagonal, if xij = 0, for every i 6= j. 7. The diagonal matrix, whose diagonal elements are 1’s is the identity matrix, and is denoted In . 8. A scalar is a matrix of order 1 × 1, or more precisely an element in the field supporting the vector space under scrutiny. 9. Two matrices, A and B of respective orders rA × cA and rB × cB , are said to be conformable if they are of an order, which is suitable for performing some operations. For instance, when performing addition on A and B, we require that rA = rB and cA = cB . 2.2 Unary and Binary Operations Matrix Addition and Subtraction 1. Matrix addition and subtraction is only conducted between conformable matrices, such that if C = A + B, it follows that all three matrices have the same order. 2. These two operations are performed elementwise. 3. Addition and subtraction are commutative, A + B = B + A. 4. Addition and subtraction are associative, (A + B) + C = A + (B + C). Matrix Multiplication 1. Matrix multiplication is a binary operation, which takes two matrices as arguments. 2. The matrix product of A and B of respective orders r × t and t × c is a matrix C of order r × c, with elements, t X (C)ij = (AB)ij = aik bkj . k=1 3. Matrix multiplication is not commutative. That is, in general, AB 6= BA. 4. Matrix multiplication is associative, such that A(BC) = (AB)C. (However, these two sequences of multiplications may differ in their computational complexity, i.e. in number of computational steps.) Matrix Transpose Department of Mathematics and Statistics, Boston University 4 MA 575: Linear Models 1. Matrix transposition is a unary operation. It takes a single matrix as argument. 2. The transpose of a matrix A of order r × c is the matrix AT of order c × r with (AT )ij = (A)ji . 3. The product of the transposes is equal to the transpose of the product in opposite order, such that (AB)T = BT AT . 4. The dot product or (Euclidean) inner product of two vectors is obtained by transposing one of them, irrespective of the order, such that for any two vectors x and y, of order n × 1, we have xT y = n X i=1 xi yi = n X yi xi = yT x. (1) i=1 5. The norm or length of a vector x is given by ||x|| = (xT x)1/2 . 6. The inner product is distributive with respect to vector addition: (x − y)T (x − y) = xT x − 2xT y + yT y. Matrix Inverse 1. An n × n square matrix A is said to be invertible (or non-singular), if there exists an (n × n) matrix B, such that AB = BA = In . 2. The inverse of a diagonal matrix D is obtained by inverting its diagonal elements elementwise, (D−1 )ii = (Dii )−1 . 2.3 Random Vectors Finally, we can combine the tools of probability theory with the ones of linear algebra, in order to consider the moments of a random vector. In the lecture notes, random vectors will be denoted by bold lower cases. This differs from the convention adopted by Weisberg in the textbook. Thus, a random vector y is defined as y := [y1 , . . . , yn ]T . The expectation of a (column) random vector y of order n×1 is given by applying the expectation operator elementwise, E[y1 ] E[y2 ] E[y] = . . .. E[yn ] As for single random variables, the expectation of a random vector is linear. For convenience, let a0 and y be n × 1 vectors, and let A be an n × n matrix. Here, a0 and A are non-random, whereas y is a random vector, as above. Then, E[a0 + Ay] = a0 + AE[y], Department of Mathematics and Statistics, Boston University 5 MA 575: Linear Models which can be verified by observing that for i = 1, . . . , n, we have n n X X (E[Ay])i = E Aij yj = Aij E[yj ] = (AE[y])i . j=1 i j=1 i Moreover, we can define the variance (or covariance) matrix of a random vector y of order n × 1, denoted Var[y], as a matrix of order n × n with the following diagonal entries, (Var[y])ii = Var[yi ], and non-diagonal entries, (Var[y])ij = Cov[yi , yj ]. Thus, altogether the variance/covariance matrix of the random vector y is given by an outer product, Var[y] := E[(y − E[y])(y − E[y])T ], Explicitly, this gives the matrix, Var[y1 ] Cov[y1 , y2 ] ... Cov[y1 , yn ] Cov[y2 , y1 ] Var[y2 ] ... Cov[y2 , yn ] .. .. .. .. . . . . . Var[y] = .. . . . . . . Cov[yn−1 , yn ] . Cov[yn , y1 ] ... Cov[yn , yn−1 ] Var[yn ] In contrast to the expectation of a random vector, the variance of the transformed version of y, is non-linear, since we have Var[a0 + Ay] = A Var[y]AT . By extension, the covariance of two random vectors y and z of orders n × 1 is given by a matrix of order n × n, Cov[y, z] = E (y − E[y])(z − E[z])T . (2) Moreover, both the variance and covariance of random vectors satisfy the classical decompositions of the variance and covariance operators for real-valued random variables. For two univariate random variables, X and Y , recall that we have Var[X] = E[X 2 ] − E[X]2 , and Cov[X, Y ] = E[XY ] − E[X]E[Y ]. Similarly, for two n-dimensional random vectors x and y, we have Var[x] = E[xxT ] − E[x]E[x]T ; and Cov[x, y] = E[xyT ] − E[x]E[y]T . 3 3.1 Three Types of Independence Probabilistic Independence Given some probability space, two events A, B ∈ Ω, are independent when P[A ∩ B] = P[A]P[B]. Two random variables are independent, when their cumulative distribution functions (CDFs) satisfy FX,Y (x, y) = FX (x)FY (y), or equivalently, in terms of their probability density functions (pdfs), fX,Y (x, y) = fX (x)fY (y). Department of Mathematics and Statistics, Boston University 6 MA 575: Linear Models 3.2 Statistical Independence (Uncorrelated) Two random variables are said to be statistically independent or uncorrelated, when their covariance is nil, Cov[X, Y ] = 0. Clearly, if X and Y are probabilistically independent, then Cov[X, Y ] = E[X, Y ] − E[X]E[Y ] = 0. 3.3 Linear Independence Two vectors of n realizations, x and y, from two random variables X and Y are linearly independent, when there does not exist any non-zero coefficients αx , αy ∈ R, such that αx x + αy y = 0, where 0 is the n-dimensional vector with zero entries. If two random variables are statistically independent, then any sequence of realizations from these random variables has a low probability of yielding two linearly independent vectors of realizations. Making such a statement precise, however, would require an appeal to much more probability theory, than is required for this course. These observations provide us with a more precise definition of covariance, as a measure of linear dependence; since the larger is the estimated covariance, the larger is the probability of obtaining two vectors of realizations, which are linearly dependent. Department of Mathematics and Statistics, Boston University 7