Download MA 575 Linear Models: Cedric E. Ginestet, Boston University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Determinant wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Jordan normal form wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Gaussian elimination wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

System of linear equations wikipedia , lookup

Four-vector wikipedia , lookup

Principal component analysis wikipedia , lookup

Matrix multiplication wikipedia , lookup

Matrix calculus wikipedia , lookup

Ordinary least squares wikipedia , lookup

Transcript
MA 575: Linear Models
MA 575 Linear Models:
Cedric E. Ginestet, Boston University
Revision: Probability and Linear Algebra
Week 1, Lecture 2
1
1.1
Revision: Probability Theory
Random Variables
A real-valued random variable is a function from a probability space (Ω, F, P), to a given domain (R, B).
(The precise meanings of these spaces are not important for the remainder of this course.) Strictly speaking,
therefore, a value or realization of that function can be written for any ω ∈ Ω,
X(ω) = x.
For notational convenience, we often omit any reference to the sample space, Ω. However, we still systematically distinguish a random variable from one of its realizations, by using upper cases for the former
and lower cases, for the latter. A realization from a random variable is also referred to as an observed
value. Note that this upper/lower case convention is not respected by Weisberg, in the main textbook for
this course.
1.2
Expectation Operator
The expectation or expected value of a real-valued random variable (r.v.), X with probability density
function (pdf) p(x), defined over the real line, R, is given by
Z
E[X] := x p(x) dx.
R
This corresponds to the “average value” of X over its domain. If the random variable X were to be defined
over a discrete space X , we would compute its expected value by replacing the integral with a cumulative
summation, as follows,
X
E[X] :=
xP[X = x].
x∈X
Since the expectation operator, E[·], takes into account all of the values of a random variable, it therefore
acts on an upper case X, and not on a single realization, x. A similar notation is adopted for the variance
and covariance operators, below.
Crucially, the expectation operator, E[X], is a linear function of X. For any given real numbers
α, β ∈ R, we have
E[α + βX] = α + βE[X].
Department of Mathematics and Statistics, Boston University
1
MA 575: Linear Models
This linear relationship can be extended to any linear
Pn combination of random variables, X1 , . . . , Xn , such
that if we are interested in the expectation of α + i=1 βXi , we obtain,
"
#
n
n
X
X
E α+
βXi = α + β
E[Xi ].
i=1
1.3
i=1
Variance Operator
The variance of a random variable X is defined as the expected squared differences between the observed
values of X and its mean value.
Var[X] := E[(X − E[X])2 ] =: E[X − E[X]]2 .
Since the Euclidean distance between two real numbers, a and b, is defined as d(a, b) := |a − b|, one may
geometrically interpret the variance as the average squared distance of the x’s, from the mean value,
E[X].
The
Pnvariance operator is non-linear. Given a linear combination of uncorrelated random variables,
λ0 + i=1 βi Xi , we have
"
#
n
n
X
X
Var α0 +
βi Xi =
βi2 Var[Xi ].
i=1
i=1
The first term, α0 , has been eliminated because the variance of a constant is nil, since E[α0 ] = α0 . Finally,
the standard deviation of a variable X is defined as
p
σX := Var[X].
1.4
Covariance and Correlation
The covariance of two random variables, X and Y , is defined as the expected product of differences
between the observed values of these two random variables, and their respective mean values.
Cov[X, Y ] := E [(X − E[X]) (Y − E[Y ])] = Cov[Y, X],
since the covariance can be seen to be symmetric, by inspection. This quantity describes how two random
variables vary jointly. As for the variance operator, the covariance is non-linear, such that for any linear
transformations of two random variables X and Y , we obtain
Cov [αx + βx X, αy + βy Y ] = βx βy Cov[X, Y ].
Also, observe that the covariance operator is a generalization of the variance, since
Cov[X, X] = Var[X].
The Pearson product-moment correlation coefficient between random variables X and Y is defined
as the covariance between X and Y , standardized by the product of their respective standard deviations.
ρ(X, Y ) := p
Cov[X, Y ]
Var[X] · Var[Y ]
.
The correlation coefficient is especially valuable, because (i) it does not depend on units of measurement,
and (ii) because its values are comprised between −1 and 1. This can be easily proved by an application of
the Cauchy-Schwarz inequality, which states that | hx, yi |2 ≤ hx, xi · hy, yi, for any x, y ∈ Rd .
Department of Mathematics and Statistics, Boston University
2
MA 575: Linear Models
1.5
Random Samples
We will often consider random samples from a given population, whose moments are controlled by some
unknown parameters, such as a mean µ and a variance σ 2 . Such a sample is commonly denoted as follows,
iid
Xi ∼ f (µ, σ 2 ),
∀ i = 1, . . . , n,
where note that we are using an upper case, on the left-hand side, for every index i from 1 to n. The
variance operator of a linear combination of random variables is determined by the relationships of the
random variables in that sequence.
i. Independent and Identically Distributed (IID) random variables:
"
#
n
n
X
X
Var α0 +
βi Xi =
βi2 σ 2 .
i=1
i=1
ii. Independent, but not identically distributed random variables:
"
#
n
n
X
X
Var α0 +
βi Xi =
βi2 Var[Xi ].
i=1
i=1
iii. Neither independent, nor identically distributed random variables:
#
"
n
n−1
n
n
X X
X
X
βi βj Cov(Xi , Xj ),
βi2 Var[Xi ] + 2
βi Xi =
Var α0 +
i=1
i=1 j=i+1
i=1
where the two in the second term comes from the fact that the matrix of covariances is symmetric.
1.6
Conditional Moments
Any conditional expectation is itself a random variable. Given two random variables Y and X, where
X takes values in some (metric) space X , with σ-algebra BX , X : (Ω, F, P) 7→ (X , BX ); the conditional
expectation of Y given X corresponds to the mapping,
EY |X : (Ω, F, P) 7→ (R, B).
Thus, for any ω ∈ Ω, with X(ω) = x, we have the following equivalence,
EY |X [Y |X = X(ω)] = EY |X [Y |X = x].
The essential rules governing the manipulations of conditional moments are as follows:
1. The law of total probability:
P[Y ] =
X
P[Y |X = x]P[X = x].
x∈X
2. The law of total expectation, or power rule:
Z
EY [Y ] = EX EY |X [Y |X] = EY |X [Y |X = x] P[X = x] dx.
X
3. The law of total variance, or variance decomposition:
VarY [Y ] = EX VarY |X [Y |X] + VarX EY |X [Y |X].
Department of Mathematics and Statistics, Boston University
3
MA 575: Linear Models
2
Revision: Linear Algebra
2.1
Basic Terminology
1. A matrix X is an arrayed arrangement of elements (X)ij = xij .
2. A matrix X is of order r × c, if it has r rows and c columns, with i = 1, . . . , r and j = 1, . . . , c.
3. A column vector is a matrix of order (r × 1). (All vectors used in this course will be assumed to be
column vectors).
4. A matrix is said to be square if r = c.
5. A matrix is symmetric, if xij = xji , for all i = 1, . . . , r, and j = 1, . . . , c.
6. A square matrix is diagonal, if xij = 0, for every i 6= j.
7. The diagonal matrix, whose diagonal elements are 1’s is the identity matrix, and is denoted In .
8. A scalar is a matrix of order 1 × 1, or more precisely an element in the field supporting the vector
space under scrutiny.
9. Two matrices, A and B of respective orders rA × cA and rB × cB , are said to be conformable if
they are of an order, which is suitable for performing some operations. For instance, when performing
addition on A and B, we require that rA = rB and cA = cB .
2.2
Unary and Binary Operations
Matrix Addition and Subtraction
1. Matrix addition and subtraction is only conducted between conformable matrices, such that if
C = A + B, it follows that all three matrices have the same order.
2. These two operations are performed elementwise.
3. Addition and subtraction are commutative, A + B = B + A.
4. Addition and subtraction are associative, (A + B) + C = A + (B + C).
Matrix Multiplication
1. Matrix multiplication is a binary operation, which takes two matrices as arguments.
2. The matrix product of A and B of respective orders r × t and t × c is a matrix C of order r × c, with
elements,
t
X
(C)ij = (AB)ij =
aik bkj .
k=1
3. Matrix multiplication is not commutative. That is, in general, AB 6= BA.
4. Matrix multiplication is associative, such that A(BC) = (AB)C. (However, these two sequences of
multiplications may differ in their computational complexity, i.e. in number of computational steps.)
Matrix Transpose
Department of Mathematics and Statistics, Boston University
4
MA 575: Linear Models
1. Matrix transposition is a unary operation. It takes a single matrix as argument.
2. The transpose of a matrix A of order r × c is the matrix AT of order c × r with (AT )ij = (A)ji .
3. The product of the transposes is equal to the transpose of the product in opposite order, such that
(AB)T = BT AT .
4. The dot product or (Euclidean) inner product of two vectors is obtained by transposing one of
them, irrespective of the order, such that for any two vectors x and y, of order n × 1, we have
xT y =
n
X
i=1
xi yi =
n
X
yi xi = yT x.
(1)
i=1
5. The norm or length of a vector x is given by ||x|| = (xT x)1/2 .
6. The inner product is distributive with respect to vector addition:
(x − y)T (x − y) = xT x − 2xT y + yT y.
Matrix Inverse
1. An n × n square matrix A is said to be invertible (or non-singular), if there exists an (n × n) matrix
B, such that
AB = BA = In .
2. The inverse of a diagonal matrix D is obtained by inverting its diagonal elements elementwise,
(D−1 )ii = (Dii )−1 .
2.3
Random Vectors
Finally, we can combine the tools of probability theory with the ones of linear algebra, in order to consider
the moments of a random vector. In the lecture notes, random vectors will be denoted by bold lower cases.
This differs from the convention adopted by Weisberg in the textbook. Thus, a random vector y is defined
as
y := [y1 , . . . , yn ]T .
The expectation of a (column) random vector y of order n×1 is given by applying the expectation operator
elementwise,


E[y1 ]
 E[y2 ] 


E[y] =  .  .
 .. 
E[yn ]
As for single random variables, the expectation of a random vector is linear. For convenience, let a0 and y
be n × 1 vectors, and let A be an n × n matrix. Here, a0 and A are non-random, whereas y is a random
vector, as above. Then,
E[a0 + Ay] = a0 + AE[y],
Department of Mathematics and Statistics, Boston University
5
MA 575: Linear Models
which can be verified by observing that for i = 1, . . . , n, we have
 



n
n
X
X
(E[Ay])i = E 
Aij yj  = 
Aij E[yj ] = (AE[y])i .
j=1
i
j=1
i
Moreover, we can define the variance (or covariance) matrix of a random vector y of order n × 1,
denoted Var[y], as a matrix of order n × n with the following diagonal entries,
(Var[y])ii = Var[yi ],
and non-diagonal entries,
(Var[y])ij = Cov[yi , yj ].
Thus, altogether the variance/covariance matrix of the random vector y is given by an outer product,
Var[y] := E[(y − E[y])(y − E[y])T ],
Explicitly, this gives the matrix,


Var[y1 ]
Cov[y1 , y2 ]
...
Cov[y1 , yn ]
 Cov[y2 , y1 ]
Var[y2 ]
...
Cov[y2 , yn ] 




..
..
..
..
.
.
.
.
.
Var[y] = 




..
.
.
.
.

.
.
Cov[yn−1 , yn ]
.
Cov[yn , y1 ]
...
Cov[yn , yn−1 ]
Var[yn ]
In contrast to the expectation of a random vector, the variance of the transformed version of y, is non-linear,
since we have
Var[a0 + Ay] = A Var[y]AT .
By extension, the covariance of two random vectors y and z of orders n × 1 is given by a matrix of order
n × n,
Cov[y, z] = E (y − E[y])(z − E[z])T .
(2)
Moreover, both the variance and covariance of random vectors satisfy the classical decompositions of the
variance and covariance operators for real-valued random variables. For two univariate random variables, X
and Y , recall that we have Var[X] = E[X 2 ] − E[X]2 , and Cov[X, Y ] = E[XY ] − E[X]E[Y ]. Similarly, for two
n-dimensional random vectors x and y, we have
Var[x] = E[xxT ] − E[x]E[x]T ;
and
Cov[x, y] = E[xyT ] − E[x]E[y]T .
3
3.1
Three Types of Independence
Probabilistic Independence
Given some probability space, two events A, B ∈ Ω, are independent when
P[A ∩ B] = P[A]P[B].
Two random variables are independent, when their cumulative distribution functions (CDFs) satisfy
FX,Y (x, y) = FX (x)FY (y),
or equivalently, in terms of their probability density functions (pdfs), fX,Y (x, y) = fX (x)fY (y).
Department of Mathematics and Statistics, Boston University
6
MA 575: Linear Models
3.2
Statistical Independence (Uncorrelated)
Two random variables are said to be statistically independent or uncorrelated, when their covariance is
nil,
Cov[X, Y ] = 0.
Clearly, if X and Y are probabilistically independent, then Cov[X, Y ] = E[X, Y ] − E[X]E[Y ] = 0.
3.3
Linear Independence
Two vectors of n realizations, x and y, from two random variables X and Y are linearly independent,
when there does not exist any non-zero coefficients αx , αy ∈ R, such that
αx x + αy y = 0,
where 0 is the n-dimensional vector with zero entries.
If two random variables are statistically independent, then any sequence of realizations from these random
variables has a low probability of yielding two linearly independent vectors of realizations. Making such a
statement precise, however, would require an appeal to much more probability theory, than is required for
this course.
These observations provide us with a more precise definition of covariance, as a measure of linear
dependence; since the larger is the estimated covariance, the larger is the probability of obtaining two
vectors of realizations, which are linearly dependent.
Department of Mathematics and Statistics, Boston University
7