* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical Methods
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Law of large numbers wikipedia , lookup
Central limit theorem wikipedia , lookup
Statistical Methods ”Never trust a statistics you didn’t forge yourself ” Winston Churchill Florian Herzog 2013 Independent and identical distributed random variables Definition 1. The random variables X1, ..., Xn are called a random sample of size n from the population f (x) if X1, ..., Xn are mutually independent random variables and the marginal pdf of each Xi is the same function Xi. Alternatively, X1, ..., Xn are called independent and identically distributed random variables with pdf f (x). The joint pdf of X1, ..., Xn is given as: f (x1, x2, ..., xn) = f (x1)f (x2) · · · f (xn) = n Y f (xi) i=1 The slides of this section follow closely the Chapters 5 and 7 of the Book ”G.Casela and R. Berger, Statistical Inference,Duxbury Press 2002”. Stochastic Systems, 2013 2 Identically and independently distributed random variables Often in statistics (especially in estimation) we assume identically and independently distributed (i.i.d) random variables (r.v.). This means that a random variable Xi, where k = 1, 2, ... denotes the realizations of the r.v., has the following properties: • Each Xk f (x) is drawn form the same density . • Xk is independent of Xk−1, Xk−2, ..., X1. • Each Xk is uncorrelated from each Xj ,i.e. E[Xk Xj ] = 0 ∀j\{k} Stochastic Systems, 2013 3 Sample mean and variance Definition 2. The sample mean is the arithmetic average of the values in a random sample and is denoted as n 1X X = Xi n i=1 Definition 3. The sample variance is the statistic defined as n 1 X 2 S = (Xi − X) n − 1 i=1 2 The sample standard deviation is the statistic defined as S = Stochastic Systems, 2013 p (S 2) 4 Properties of i.i.d.random variables Theorem 1. When we have X1, X2, ...Xn independent and identically distributed (i.i.d) random variables with mean µ = E[Xn] and variance σ 2 = V ar[Xn]. Then E[X] V ar[X] 2 E[S ] Stochastic Systems, 2013 = µ = σ2 n = σ 2 5 Properties of i.i.d.random variables Theorem 2. When we have X1, X2, ...Xn i.i.d. from a normal distribution with mean µ and variance σ . Then 1. X and S 2 are independent random variable, 2 2. X is distributed N (µ, σn ), 2 3. (n − 1) Sσ2 has a chi square distribution with n − 1 degrees of freedom. Stochastic Systems, 2013 6 Convergence of a sequence of r.v. {Xn} 1 1. Convergence with probability one (or almost sure), Xn −→ X : P ({ω ∈ Ω : lim (Xn(ω)) = X(ω)}) = 1. n→∞ p 2. The sequence {Xn} converges to X in probability, Xn −→ X , if lim P ({ω ∈ Ω : |Xn(ω) − X(ω)| > ε}) = 0, n→∞ p for all ε > 0. Lp 3. The sequence {Xn} converges to X in L , Xn −→ X , if p lim E(|Xn(ω) − X(ω)| ) = 0. n→∞ Stochastic Systems, 2013 7 Convergence concepts interrelations Lp Xn −→ X ? 1 Xn −→ X (almost sure) q ) Lq Xn −→ X, q<p p Xn −→ X (in probability) ? d Xn −→ X (in distribution) Stochastic Systems, 2013 8 Parameter estimation (Point estimation) In stochastic systems modeling, we often build models from data observation (and not from physical first principles). We need statistically motivated methods to identify the stochastic systems under consideration. The identification of the stochastic systems requires the following: • • • • Identification of the distribution Identification of the dynamics Identification of the system parameters Analysis of the parameter significance In this section we only focus on the parameter estimation and assume that the distribution is known. We will come back to this topic after the theoretical introduction of stochastic processes. Stochastic Systems, 2013 9 Parameter estimation (Point estimation) Definition 1. A point estimator is any function W (X1, X2, ..., Xn) of a sample of random variables. There are main ways of finding point estimators, the main ones are: • • • • Methods of moments (MM) Maximum Likelihood estimators (MLE) Expectation Maximization (EM) Bayes Estimators Besides the methods of finding a point estimator, the evaluation (quality) of the estimator. In the following slides, we will introduce the methods of moments and the maximum likelihood estimator. Stochastic Systems, 2013 10 Methods of moments We have X1, X2, ...., Xn the sample from a population from one pdf f (x|θ1, θ2, ..., θk ). The parameter θi are the distribution parameter, e.g. σ and µ in the case of a normal distribution. Definition 2. The method of moments is the matching of the first k moments of the data with the first k theoretical moments of the distribution. The theoretical moments are a function of the parameters and parameter estimation problem is reduced to the solving of k equations Stochastic Systems, 2013 11 Methods of moments We have n m1 = 1X 0 Xi, µ1 = E[X], n i=1 = 1X 2 0 2 Xi , µ2 = E[X ], n i=1 n m2 ... n mk = 1X k 0 k Xi , µk = E[X ], n i=1 where mi denotes the sample moment and µ0i the theoretical moments. Stochastic Systems, 2013 12 Methods of moments Since µ0i is a function of θi, we get the following system of equations: 0 m1 = µ1(θ1, ..., θk ), m2 = ... µ2(θ1, ..., θk ), mk = µk (θ1, ..., θk ), 0 0 where mi denotes the sample moment and µ0i the theoretical moments. The parameters are found by solving the system of k moments. Stochastic Systems, 2013 13 Methods of moments As main example, we assume that the data is generated by a normal distribution with mean µ and variance σ 2. We denote θ1 = µ and θ2 = σ 2. The first and second moment of the normal distribution are given as n 0 µ1 = 1X Xi µ= n i=1 = 1X 2 Xi µ +σ = n i=1 n 0 µ2 Stochastic Systems, 2013 2 2 14 Methods of moments Solving for µ and σ 2 we get: n µ σ 2 = 1X Xi n i=1 = n 1X 2 Xi − n i=1 n X i=1 !2 Xi n 1X 2 = (Xi − µ) n i=1 The solution are the sample moments of mean and variance and are of course the natural”way of estimation the mean and variance of the normal distribution. Stochastic Systems, 2013 15 Maximum Likelihood Estimation The likelihood is the joint pdf of X1, ..., Xn and given as: L(θ1, ...θk |x1, x2, ..., xn) = n Y f (xi|θ1, ...θk ) . i=1 We denote by x = [x1, x2, ...]T and by θ = [θ1, θ2, ...]T b Definition 3. For each sample x, let θ(x) be a parameter value at which L(θ|x) attains its maximum as function of θ . A maximum likelihood estimator b . MLE of the parameter θ based on the X is θ(x) If the likelihood function is C 2, then possible candidates for the MLE are the values of θ which solve ∂ L(θ1, ...θk |x1, x2, ..., xn) = 0 . ∂θi Stochastic Systems, 2013 16 Maximum log-Likelihood Estimation Theorem 1. The maximum likelihood estimation is equivalent to the maximum log-likelihood estimation. The log-likelihood is defined as l(θ1, ...θk |x1, x2, ..., xn) = n X log (f (xi|θ1, ...θk )) . i=1 Example: We want to derive the maximum likelihood estimated for the mean (µ) of the normal distribution under the assumption of known variance σ 2. The log-pdf of the normal distribution is given as: ! 2 (xi − µ) 1 2 log(f (xi|µ)) = − log(2πσ ) + 2 σ2 Stochastic Systems, 2013 17 Maximum log-Likelihood Estimation The maximum log-likelihood function is given as: l(µ|x1, x2, ..., xn) = n X − i=1 1 2 2 log(2πσ ) + (xi − µ) σ2 2 ! Since σ is known, the maximization problem is reduced to least square problem: min µ which as the solution of n X (xi − µ) 2 i=1 n 1X xi µ b= n i=1 Stochastic Systems, 2013 18 Invariance of Maximum Likelihood Estimation Theorem 2. The invariance property of MLEs state that if θb is the MLE of θ b. then for any function τ (θ), the MLE of τ (θ) is τ (θ) Suppose that a distribution is parameterized by a parameter θ , but we are interest in finding an estimator for some function of θ , say τ (θ), then we can still use the MLE for θ . An example is as follows: If θ is the mean of normal distribution, the MLE of sin(θ) is sin(b µ). Stochastic Systems, 2013 19 Quality of estimators: MSE Definition 4. The mean squared error (MSE) of an W of the a parameter θ is the function defined by Eθ [(W − θ)2. The MSE of W measures the average squared distance between the estimator and the true value of the parameter. The MSE has the following interpretation: 2 Eθ [(W − θ) ] = V arθ [W ] + (Eθ [W ] − θ) 2 The first term is the variance of the estimator W and the second term is called the bias. Definition 5. The bias of an estimator W of the parameter θ is the distance between the expected value of W and the true value of θ . An estimator where the bias is zero is called an unbiased estimator. Stochastic Systems, 2013 20 Quality of estimators: Bias and variance In the multivariate case where θ is a vector of parameters, the variance of the estimator is a covariance Cov(θ). An estimator with low variance (covariance) is called an efficient estimator (in the sense that few data is needed). The MSE is often an trade-off between unbiasedness and higher variance or an biased but efficient estimator. The true value of the MSE can often not be determined since the true value of θ is not known. Therefore, we focus on the variance of the estimator in order to describe the quality of the estimator. p p Definition 6. An estimator is called called consistent when θb → θ where → denotes convergence in probability. An unbiased estimator is also consistent. Stochastic Systems, 2013 21 Quality of estimators: Normal distribution example When we have X1, X2, .. i.i.d. data from a N (µ, σ 2) distribution and use the sample mean X and sample variance S 2 as estimator: • E[X] = µ and therefore, X is unbiased • E[S 2] = σ 2 and therefore, S 2 is unbiased σ2 n 2σ 4 ar[S 2] = n−1 • E[(X − µ)2] = V ar[X] = • E[(S 2 − σ 2)2] = V 2 The MSE of X is still σn when the data is not normal, but this does not hold for MSE for S 2 when the data is not normally distributed. Stochastic Systems, 2013 22 Quality of estimators: Cramer-Rao bound for the variance Definition 7. The Fisher Information matrix J is defined as 1 ∂ l(θ1, ...θk |x1, x2, ..., xn) Ji,j = n ∂θi ∂ · l(θ1, ...θk |x1, x2, ..., xn) , ∂θj which is known as the outer product form. Under certainty regularity conditions and when the log-likelihood function is C 2, it can be calculated as: ! 2 1 ∂ l(θ1, ...θk |x1, x2, ..., xn) , Ji,j = − n ∂θi∂θj which is called the inner product form. Note that the expectation is conditional on θ Stochastic Systems, 2013 23 Quality of estimators: Cramer-Rao bound for the variance The Cramer-Rao bound states the following: Theorem 3. The covariance of an estimator W is bounded by J −1 , Cov(W ) ≥ N where N is the number of observations- This bound also to make an worst case approximation of the efficiency of an estimator. The Fisher Information matrix allows us to compute the uncertainty and thus, the quality of an estimator. Stochastic Systems, 2013 24 Quality of estimators MLE is the main methods for finding estimators, since it has the following properties: • Consistency: the estimator converges in probability to the value being estimated. • Asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution with mean θ and covariance matrix equal to the inverse of the Fisher information matrix. • Efficiency, i.e., it achieves the Cramer-Rao lower bound when the sample size tends to infinity. This means that no asymptotically unbiased estimator has lower asymptotic mean squared error than the MLE −1 • The estimate of θ ∼ N (θM L, JN ). • Barretts Theorem ”The maximum-likelihood procedure in any problem is what you are most likely to do if you don’t know any statistics”. Stochastic Systems, 2013 25