Download ppt

Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions, Moments Generating Functions, Deviation Bounds, Limit Theorems Basics from Information Theory 2.2 Statistical Inference: Sampling and Estimation Moment Estimation, Confidence Intervals Parameter Estimation, Maximum Likelihood, EM Iteration 2.3 Statistical Inference: Hypothesis Testing and Regression Statistical Tests, p-Values, Chi-Square Test Linear and Logistic Regression mostly following L. Wasserman, with additions from other sources IRDM WS 2005 2-1 2.2 Statistical Inference: Sampling and Estimation A statistical model is a set of distributions (or regression functions), e.g., all unimodal, smooth distributions. A parametric model is a set that is completely described by a finite number of parameters, (e.g., the family of Normal distributions). Statistical inference: given a sample X1, ..., Xn how do we infer the distribution or its parameters within a given model. For multivariate models with one specific „outcome (response)“ variable Y, this is called prediction or regression, for discrete outcome variable also classification. r(x) = E[Y | X=x] is called the regression function. IRDM WS 2005 2-2 Statistical Estimators A point estimator for a parameter  of a prob. distribution is a random variable X derived from a random sample X1, ..., Xn. Examples: 1 n Sample mean: X :  X i n i 1 1 n 2 2 Sample variance: S : ( X i  X ) n  1 i 1 An estimator T for parameter  is unbiased if E [ T ]   ; otherwise the estimator has bias E[ T ]  . An estimator on a sample of size n is consistent if limn P[ T    ]  1 for each   0 Sample mean and sample variance are unbiased, consistent estimators with minimal variance. IRDM WS 2005 2-3 Estimator Error Let ̂n = T() be an estimator for parameter  over sample X1, ..., Xn. The distribution of ̂n is called the sampling distribution. The standard error for ̂n is: se( ˆ )  Var[ ˆ ] The mean squared error (MSE) for ̂n is: MSE( ˆ )  E[( ˆ   )2 ] n  bias 2 ( ˆn )  Var[ ˆn ] If bias  0 and se  0 then the estimator is consistent. The estimator ̂n is asymptotically Normal if ( ˆn   ) / se converges in distribution to standard Normal N(0,1) IRDM WS 2005 2-4 Nonparametric Estimation The empirical distribution function F̂n is the cdf that puts prob. mass 1/n at each data point Xi: F̂ ( x )  1  n I( X  x ) n i n i 1 A statistical functional T(F) is any function of F, e.g., mean, variance, skewness, median, quantiles, correlation The plug-in estimator of  = T(F) is: ˆn  T( Fˆn ) Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X1, ..., Xn are grouped into m cells (buckets) c1, ..., cm with bucket boundaries lb(ci) and ub(ci) s.t. lb(c1) = , ub(cm) = , ub(ci) = lb(ci+1) for 1i<m, and freq(ci) = F̂n ( x )  1 n I( lb( ci )  X  ub( ci )) n  1 Histograms provide a (discontinuous) density estimator. IRDM WS 2005 2-5 Parametric Inference: Method of Moments Compute sample moments: ˆ n  1 n j X  n i 1 i for j-th moment j Estimate parameter  by method-of-moments estimator ̂n s.t.  ( F( ˆ ))  ˆ 1 n 1 and 2( F( ˆn ))  ˆ 2 and 3( F( ˆn ))  ˆ 3 and ... (for some number of moments) Method-of-moments estimators are usually consistent and asympotically Normal, but may be biased IRDM WS 2005 2-6 Parametric Inference: Maximum Likelihood Estimators (MLE) Estimate parameter  of a postulated distribution f(,x) such that the probability that the data of the sample are generated by this distribution is maximized.  Maximum likelihood estimation: Maximize L(x1,...,xn, ) = P[x1, ..., xn originate from f(,x)] (often written as L( | x1,...,xn) = f(x1,...,xn | ) ) or maximize log L if analytically untractable  use numerical iteration methods IRDM WS 2005 2-7 MLE Properties Maximum Likelihood Estimators are consistent, asymptotically Normal, and asymptotically optimal in the following sense: Consider two estimators U and T which are asymptotically Normal. Let u2 and t2 denote the variances of the two Normal distributions to which U and T converge in probability. The asymptotic relative efficiency of U to T is ARE(U,T) = t2/u2 . Theorem: For an MLE ̂n and any other estimator n the following inequality holds: ARE(  ,ˆ )  1 n IRDM WS 2005 n 2-8 Simple Example for Maximum Likelihood Estimator given: • coin with Bernoulli distribution with unknown parameter p für head, 1-p for tail • sample (data): k times head with n coin tosses needed: maximum likelihood estimation of p Let L(k, n, p) = P[sample is generated from distr. with param. p] n k    p (1  p) n  k k  Maximize log-likelihood function log L (k, n, p): n log L  log    k log p  (n  k) log (1  p) k k  log L k n  k  p    0 n p p 1 p IRDM WS 2005 2-9 Advanced Example for Maximum Likelihood Estimator given: • Poisson distribution with parameter  (expectation) • sample (data): numbers x1, ..., xn N0 needed: maximum likelihood estimation of  Let r be the largest among these numbers, and let f0, ..., fr be the absolute frequencies of numbers 0, ..., r. fi i r      L( x1 ,..., xn , )   e i!  i 0 r  ln L i     fi   1  0   i 0   r  ̂  i fi  i r0  fi 1 n   xi  x n i 1 i 0 IRDM WS 2005 2-10 Sophisticated Example for Maximum Likelihood Estimator given: • discrete uniform distribution over [1,]  N0 and density f(x) = 1/  • sample (data): numbers x1, ..., xn N0 MLE for  is max{x1, ..., xn } (see Wasserman p. 124) IRDM WS 2005 2-11 MLE for Parameters of Normal Distributions  1 2 L( x1 ,..., xn ,  , )    2  n  n   e  i 1 ( xi   )2  2 2  ln( L ) 1 n  2( xi   )  0  2  2 i 1  ln( L )  2  n 2 2 1 n  ˆ   xi n i 1 IRDM WS 2005  1 n ( xi   )2  0  2 4 i 1 1 n ˆ   ( xi  ˆ )2 n i 1 2 2-12 Bayesian Viewpoint of Parameter Estimation • assume prior distribution f() of parameter  • choose statistical model (generative model) f(x | ) that reflects our beliefs about RV X • given RVs X1, ..., Xn for observed data, the posterior distribution is f( | x1, ..., xn) for X1=x1, ..., Xn=xn the likelihood is n n f (  | xi )   ' f ( xi |  ') f (  ') L(x1, ..., xn | )  i 1 f ( xi |  ) i 1 f ( ) which implies f (  | x1 ,...,xn ) ~ L( x1 ,...,xn |  )  f (  ) (posterior is proportional to likelihood times prior) MAP estimator (maximum a posteriori): compute  that maximizes f( | x1, …, xn) given a prior for  IRDM WS 2005 2-13 Analytically Non-tractable MLE for parameters of Multivariate Normal Mixture consider samples from a mixture of multivariate Normal distributions with the density (e.g. height and weight of males and females):    f ( x,  1 ,...,  k , 1 ,..., k , 1 ,...,  k ) k k      j n( x ,  j ,  j )    j j 1 j 1 1 (2 ) m  j 1      ( x   j )T  j1 ( x   j ) e 2   with expectation values j and invertible, positive definite, symmetric mm covariance matrices  j  maximize log-likelihood function: n n k       log L( x1,..., xn , ) : log  P[ xi |  ]    log   j n( xi ,  j ,  j )    i 1 j 1 i 1  2-14 IRDM WS 2005 Expectation-Maximization Method (EM) Key idea: when L(, X1, ..., Xn) (where the Xi and  are possibly multivariate) is analytically intractable then • introduce latent (hidden, invisible, missing) random variable(s) Z such that • the joint distribution J(X1, ..., Xn, Z, ) of the „complete“ data is tractable (often with Z actually being Z1, ..., Zn) • derive the incomplete-data likelihood L(, X1, ..., Xn) by integrating (marginalization) J: ˆ  arg max  z J [  ,X 1 ,...,X n ,Z | Z  z ]P[ Z  z ] IRDM WS 2005 2-15 EM Procedure Initialization: choose start estimate for (0) Iterate (t=0, 1, …) until convergence: E step (expectation): estimate posterior probability of Z: P[Z | X1, …, Xn, (t)] assuming  were known and equal to previous estimate (t), and compute EZ | X1, …, Xn, (t) [log J(X1, …, Xn, Z | )] by integrating over values for Z M step (maximization, MLE step): Estimate (t+1) by maximizing EZ | X1, …, Xn, (t) [log J(X1, …, Xn, Z | )] convergence is guaranteed (because the E step computes a lower bound of the true L function, and the M step yields monotonically non-decreasing likelihood), but may result in local maximum of log-likelihood function IRDM WS 2005 2-16 EM Example for Multivariate Normal Mixture Expectation step (E step): hij : P[ Zij  1| xi , (t ) P[ xi | n j (  (t) ] k )] (t) P[ x | n (  )]  i l l 1 Zij = 1 if ith data point was generated by jth component, 0 otherwise Maximization step (M step): n  hij xi  j : i n1  hij n T h ( x   )( x   )  ij i j i j  j : i 1 n  hij i 1 n  j :  hij i 1 k n   hij n  hij i 1  ( t 1)  i 1 n j 1i 1 IRDM WS 2005 2-17 Confidence Intervals Estimator T for an interval for parameter  such that P[ T  a    T  a ]  1   [T-a, T+a] is the confidence interval and 1- is the confidence level. For the distribution of random variable X a value x (0<  <1) with P[ X  x ]    P[ X  x ]  1   is called a  quantile; the 0.5 quantile is called the median. For the normal distribution N(0,1) the  quantile is denoted  . IRDM WS 2005 2-18 Confidence Intervals for Expectations (1) Let x1, ..., xn be a sample from a distribution with unknown expectation  and known variance 2. For sufficiently large n the sample mean X is N(,2/n) distributed and ( X   ) n is N(0,1) distributed:  ( X  ) n P[  z   z ]  ( z )  (  z )  ( z )  ( 1  ( z ))  z z  P[ X  X ] n n  P[ X  1 / 2  X 1 / 2  n n For required confidence interval z : a n  then look up (z) IRDM WS 2005 ]  1 [ X  a, X  a ] or or confidence level 1- set z : ( 1   2 ) quantile of N ( 0,1 ) z then a : n 2-19 Confidence Intervals for Expectations (2) Let x1, ..., xn be a sample from a distribution with unknown expectation  and unknown variance 2 and sample variance S2 . For sufficiently large n the random variable T : ( X  ) n S has a t distribution (Student distribution) with n-1 degrees of freedom:  n 1   2  fT ,n ( t )   n    2 1 n 1 t2  2  n 1    n    with the Gamma function: ( x )   e t t x 1 dt für x  0 0 ( with the properties ( 1 )  1 and ( x  1 )  x ( x ) )  P[ X  t n1,1 / 2 S n IRDM WS 2005 X t n1,1 / 2 S ]  1 n 2-20 Normal Distribution Table IRDM WS 2005 2-21 Student‘s t Distribution Table IRDM WS 2005 2-22 2.3 Statistical Inference: Hypothesis Testing and Regression Hypothesis testing: • aims to falsify some hypothesis by lack of statistical evidence • design of test RV and its (approximate / limit) distribution Regression: • aims to estimate joint distribution of input and output RVs based on some model and usually minimizing quadratic error IRDM WS 2005 2-23 Statistical Hypothesis Testing A hypothesis test determines a probability 1- (test level , significance level) that a sample X1, ..., Xn from some unknown probability distribution has a certain property. Examples: 1) The sample originates from a normal distribution. 2) Under the assumption of a normal distribution the sample originates from a N(, 2) distribution. 3) Two random variables are independent. 4) Two random variables are identically distributed. 5) Parameter  of a Poisson distribution from which the sample stems has value 5. 6) Parameter p of a Bernoulli distribution from which the sample stems has value 0.5. General form: null hypothesis H0 vs. alternative hypothesis H1 needs test variable X (derived from X1, ..., Xn, H0, H1) and test region R with Retain H0 Reject H0 XR for rejecting H0 and H0 true  type I error XR for retaining H0 H1 true type II error  IRDM WS 2005 2-24 Hypotheses and p-Values A hypothesis of the form  = 0 is called a simple hypothesis. A hypothesis of the form  > 0 or  < 0 is called composite hypothesis. A test of the form H0:  = 0 vs. H1:   0 is called a two-sided test. A test of the form H0:   0 vs. H1:  > 0 or H0:   0 vs. H1:  < 0 is called a one-sided test. Suppose that for every level   (0,1) there is a test with rejection region R. Then the p-value is the smallest level at which we can reject H0: p-value  inf{  |T( X 1 ,...,X n )  R small p-value means strong evidence against H0 IRDM WS 2005 2-25 Hypothesis Testing Example Null hypothesis for n coin tosses: coin is fair or has head probability p = p0; alternative hypothesis: p  p0 Test variable: X, the #heads, is N(pn, p(1-p)n2) distributed (by the Central Limit Theorem), X /n p thus is N(0, 1) distributed Z : p(1  p) Rejection of null hypothesis at test level  (e.g. 0.05) if Z  1 / 2  Z   / 2 IRDM WS 2005 2-26 Wald Test ˆ  0 for testing H0:  = 0 vs. H1:   0 use the test variable W  se( ˆ ) with sample estimate ̂ and standard error se( ˆ )  Var[ ˆ ] W converges in probability to N(0,1)  reject H0 at level  when |W| >  / 2 IRDM WS 2005 2-27 Chi-Square Distribution Let X1, ..., Xn be independent, N(0,1) distributed random variables. Then the random variable  n2 : X 12  ...  X n2 is chi-square distributed with n degrees of freedom: f 2( x)  n n2 x 2 n 22  e x 2 for x  0, 0 otherwise n    2 Let n be a natural number, let X be N(0,1) distributed and Y 2 distributed with n degrees of freedom. X Then the random variable Tn : n Y is t distributed with n degrees of freedom. IRDM WS 2005 2-28 Chi-Square Goodness-of-Fit-Test Given: n sample values X1, ..., Xn of random variable X with relative frequencies H1, ..., Hk for k value classes vi (e.g. value intervals) of random variable X Null hypothesis: the values Xi are f distributed (e.g. uniformly distributed), where f has expectation  and variance 2 ( H i  E (vi ) 2 Approach: Yk :  ( H i  E (vi )) n /  and Z k :  E (vi ) i 1 i 1 k k with E(vi) := n P[X is in class vi according to f ] are both approximately 2 distributed with k-1 degrees of freedom 2 Z   Rejection of null hypothesis at test level  (e.g. 0.05) if k k 1,1 IRDM WS 2005 2-29 Chi-Square Independence Test Given: n samples of two random variables X, Y or, equivalently, a twodimensional random variable with (absolute) frequencies H11, ..., Hrc for r*c value classes, where X has r and Y has c distinct classes. (This is called a contingency table.) Null hypothesis: X und Y are independent; then the expectations for the relative frequencies of the value classes would be Eij  Ri C j n c r j 1 i 1 with Ri :  H ij and C j :  H ij r c ( H  E )2 ij ij Approach: Z :   is approximately 2 distributed Eij i 1 j 1 with (r-1)(c-1) degrees of freedom Rejection of null hypothesis at test level  (e.g. 0.05) if Z  (2r 1 )( c1 ), 1 IRDM WS 2005 2-30 Chi-Square Distribution Table IRDM WS 2005 2-31 Chi-Square Distribution Table IRDM WS 2005 2-32 Linear Regression (often used for parameter fitting of models) Estimate r(x) = E[Y | X1=x1  ... Xm=xm] using a linear model m Y  r( x )    0  i 1 i xi   with error  with E[]=0 given n sample points (x1(i) , ..., xm(i), y(i)), i=1..n, the least-squares estimator (LSE) minimizes the quadratic error: 2   (i) (i)     k xk   y  : E( 0 ,...,  m ) i 1..n   k 0..m   (with xo(i)=1) E Solve linear equation system:   0 for k=0, ..., m k equivalent to MLE   ( X T X )1 X T Y with Y = (y(1) ... y(n))T and IRDM WS 2005  1 x( 1 )  1  1 x( 2 ) X  1  ...  (n) 1 x  1 (1)  x2( 1 ) ... xm  (2)  x2( 2 ) ... xm   ( n )  x2( n ) ... xm  2-33 Logistic Regression Estimate r(x) = E[Y | X=x] using a logistic model Y  r( x )    e 0  im1 i xi 0  im1 i xi 1 e  with error  with E[]=0  solution for MLE for i values based on numerical methods IRDM WS 2005 2-34 Additional Literature for Chapter 2 • Manning / Schütze: Chapters 2 und 6 • Duda / Hart / Stork: Appendix A • R. Nelson: Probability, Stochastic Processes, and Queueing Theory, Springer, 1995 • M. Mitzenmacher, E. Upfal: Probability and Computing, Cambridge University Press, 2005 • M. Greiner, G. Tinhofer: Stochastik für Studienanfänger der Informatik, Carl Hanser Verlag, 1996 • G. Hübner: Stochastik, Vieweg, 1996 • Sean Borman: The Expectation Maximization Algorithm: A Short Tutorial, http://www.seanborman.com/publications/EM_algorithm.pdf • Jason Rennie: A Short Tutorial on Using Expectation-Maximization with Mixture Models, http://people.csail.mit.edu/jrennie/writing/mixtureEM.pdf IRDM WS 2005 2-35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt