Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions, Moments Generating Functions, Deviation Bounds, Limit Theorems Basics from Information Theory 2.2 Statistical Inference: Sampling and Estimation Moment Estimation, Confidence Intervals Parameter Estimation, Maximum Likelihood, EM Iteration 2.3 Statistical Inference: Hypothesis Testing and Regression Statistical Tests, p-Values, Chi-Square Test Linear and Logistic Regression mostly following L. Wasserman Chapters 1-5, with additions from other textbooks on stochastics IRDM WS 2005 2-1 2.1 Basic Probability Theory A probability space is a triple (, E, P) with • a set of elementary events (sample space), • a family E of subsets of with E which is closed under , , and with a countable number of operands (with finite usually E=2), and • a probability measure P: E [0,1] with P[]=1 and P[i Ai] = i P[Ai] for countably many, pairwise disjoint Ai Properties of P: P[A] + P[A] = 1 P[A B] = P[A] + P[B] – P[A B] P[] = 0 (null/impossible event) P[ ] = 1 (true/certain event) IRDM WS 2005 2-2 Independence and Conditional Probabilities Two events A, B of a prob. space are independent if P[A B] = P[A] P[B]. A finite set of events A={A1, ..., An} is independent if for every subset S A the equation P[ Ai ] P[Ai ] Ai S Ai S holds. The conditional probability P[A | B] of A under the P[ A B] condition (hypothesis) B is defined as: P[ A | B] P[ B] Event A is conditionally independent of B given C if P[A | BC] = P[A | C]. IRDM WS 2005 2-3 Total Probability and Bayes’ Theorem Total probability theorem: For a partitioning of into events B1, ..., Bn: n P[ A] P[ A| Bi ] P[ Bi ] i 1 Bayes‘ theorem: P[ B | A] P[ A] P[ A | B] P[ B] P[A|B] is called posterior probability P[A] is called prior probability IRDM WS 2005 2-4 Random Variables A random variable (RV) X on the prob. space (, E, P) is a function X: M with M R s.t. {e | X(e) x} E for all x M (X is measurable). FX: M [0,1] with FX(x) = P[X x] is the (cumulative) distribution function (cdf) of X. With countable set M the function fX: M [0,1] with fX(x) = P[X = x] is called the (probability) density function (pdf) of X; in general fX(x) is F‘X(x). For a random variable X with distribution function F, the inverse function F-1(q) := inf{x | F(x) > q} for q [0,1] is called quantile function of X. (0.5 quantile (50th percentile) is called median) Random variables with countable M are called discrete, otherwise they are called continuous. For discrete random variables the density function is also referred to as the probability mass function. IRDM WS 2005 2-5 Important Discrete Distributions • Bernoulli distribution with parameter p: P [ X x ] p x ( 1 p )1 x for x { 0,1} • Uniform distribution over {1, 2, ..., m}: P [ X k ] f X (k ) 1 m for1 k m • Binomial distribution (coin toss n times repeated; X: #heads): n k P [ X k ] f X (k ) p (1 p) nk k • Poisson distribution (with rate ): P[ X k ] f X k (k ) e k! • Geometric distribution (#coin tosses until first head): P [ X k ] f X (k ) (1 p) k p • 2-Poisson mixture (with a1+a2=1): P [ X k ] f X ( k ) a1 e IRDM WS 2005 1 1k k! a2 e 2 2 k k! 2-6 Important Continuous Distributions • Uniform distribution in the interval [a,b] 1 f X ( x ) ba for a x b ( 0 otherwise ) • Exponential distribution (z.B. time until next event of a Poisson process) with rate = limt0 (# events in t) / t : f X ( x ) e x for x 0 ( 0 otherwise ) 1x 2 x f ( x ) p e ( 1 p ) e • Hyperexponential distribution: X 1 2 a 1 ab f ( x ) for x b , 0 otherwise • Pareto distribution: X b x c f ( x ) Example of a „heavy-tailed“ distribution with X x 1 1 • logistic distribution: FX ( x ) 1 e x IRDM WS 2005 2-7 Normal Distribution (Gaussian Distribution) • Normal distribution N(,2) (Gauss distribution; approximates sums of independent, identically distributed random variables): f X ( x) 1 2 2 e ( x )2 2 2 • Distribution function of N(0,1): z ( z ) 1 2 x2 e 2 dx Theorem: Let X be normal distributed with expectation and variance 2. X Then Y : is normal distributed with expectation 0 and variance 1. IRDM WS 2005 2-8 Multidimensional (Multivariate) Distributions Let X1, ..., Xm be random variables over the same prob. space with domains dom(X1), ..., dom(Xm). The joint distribution of X1, ..., Xm has a density function f X1,...,X m ( x1 ,...,xm ) with x1dom( X1 ) or ... xmdom( X m ) ... dom( X 1 ) f X1 ,...,X m ( x1 ,..., xm ) 1 f X 1,..., Xm ( x1 ,...,xm ) dxm ...dx1 1 dom( X m ) The marginal distribution of Xi in the joint distribution of X1, ..., Xm has the density function ... ... f X1 ,...,X m ( x1 ,..., xm ) or x1 xi 1 xi 1 ... X1 ... f X1 ,...,X m ( x1 ,..., xm ) dxm ... dxi 1 dxi 1 ... dx1 X i 1 X i 1 IRDM WS 2005 xm Xm 2-9 Important Multivariate Distributions multinomial distribution (n trials with m-sided dice): n k1 km p1 ... pm P [ X 1 k1 ... X m k m ] f X1 ,...,X m ( k1 ,..., k m ) k1 ... k m n n! : with k1 ... k m k1! ... k m ! multidimensional normal distribution: f X1 ,...,X m ( x ) 1 ( 2 )m 1 ( x )T 1 ( x ) e 2 with covariance matrix with ij := Cov(Xi,Xj) IRDM WS 2005 2-10 Moments For a discrete random variable X with density fX E [ X ] k f X (k ) is the expectation value (mean) of X E [ X i ] k i f X (k ) is the i-th moment of X kM kM V [ X ] E [( X E[ X ])2 ] E[ X 2 ] E[ X ]2 is the variance of X For a continuous random variable X with density fX E [ X ] x f X ( x) dx is the expectation value of i E [ X ] xi f X ( x) dx is the i-th moment of X 2 2 2 V [ X ] E [( X E[ X ]) ] E[ X ] E[ X ] X is the variance of X Theorem: Expectation values are additive: E [ X Y ] E[ X ] E[ Y ] (distributions are not) IRDM WS 2005 2-11 Properties of Expectation and Variance E[aX+b] = aE[X]+b for constants a, b E[X1+X2+...+Xn] = E[X1] + E[X2] + ... + E[Xn] (i.e. expectation values are generally additive, but distributions are not!) E[X1+X2+...+XN] = E[N] E[X] if X1, X2, ..., XN are independent and identically distributed (iid RVs) with mean E[X] and N is a stopping-time RV Var[aX+b] = a2 Var[X] for constants a, b Var[X1+X2+...+Xn] = Var[X1] + Var[X2] + ... + Var[Xn] if X1, X2, ..., Xn are independent RVs Var[X1+X2+...+XN] = E[N] Var[X] + E[X]2 Var[N] if X1, X2, ..., XN are iid RVs with mean E[X] and variance Var[X] and N is a stopping-time RV IRDM WS 2005 2-12 Correlation of Random Variables Covariance of random variables Xi and Xj:: Cov( Xi, Xj) : E[ ( Xi E[ Xi]) ( Xj E[ Xj]) ] Var( Xi ) Cov( Xi , Xi ) E [ X 2 ] E [ X ] 2 Correlation coefficient of Xi and Xj Cov ( Xi, Xj) ( Xi, Xj) : Var ( Xi) Var ( Xj) Conditional expectation of X given Y=y: x f X|Y (x | y) E[X | Y y] x f X|Y (x | y) dx IRDM WS 2005 discrete case continuous case 2-13 Transformations of Random Variables Consider expressions r(X,Y) over RVs such as X+Y, max(X,Y), etc. 1. For each z find Az = {(x,y) | r(x,y)z} 2. Find cdf FZ(z) = P[r(x,y) z] = Az fX,Y (x, y)dx dy 3. Find pdf fZ(z) = F‘Z(z) Important case: sum of independent RVs (non-negative) Z = X+Y FZ(z) = P[r(x,y) z] = x yz f X (x)f Y (y) dx dy yx z x z y0 x 0 f X (x)f Y (y) dx dy z f (x)FY (z x) x 0 X or in discrete case: dx FZ (z) x yz f X (x)f Y (y) Convolution x y IRDM WS 2005 2-14 Generating Functions and Transforms X, Y, ...: continuous random variables A, B, ...: discrete random variables with with non-negative real values MX(s) e sx f X ( x )dx E [ e sX non-negative integer values GA ( z ) z i f A( i ) E[ z A ] : ]: i 0 0 moment-generating function of X f *X ( s ) e sx f X ( x )dx E [ e sX ] generating function of A (z transform) 0 Laplace-Stieltjes transform (LST) of X f A* ( s ) M A( s ) GA( e s ) Examples: exponential: f X ( x ) e f *X ( s ) IRDM WS 2005 x s Erlang-k: fX ( x ) k( kx )k 1 ( k 1 )! Poisson: e kx k k f *X ( s ) k s f A ( k ) e k k! GA( z ) e ( z 1 ) 2-15 Properties of Transforms s 2 E[ X 2 ] s 3 E[ X 3 ] M X ( s ) 1 sE[ X ] ... 2! 3! d nM X ( s ) E[ X ] (0 ) n ds n 1 d nG A ( z ) f A( n ) (0 ) n n! dz dGA ( z ) E[ A] (1) dz f X ( x ) ag( x ) bh( x ) f * ( s ) ag* ( s ) bh* ( s ) f X ( x ) g'( x ) f * ( s ) sg* ( s ) g(0 ) x f X ( x ) g( t )dt f * ( s ) 0 g*( s ) s Convolution of independent random variables: z FX Y ( z ) f X ( x) FY ( z x) dx 0 f * X Y (s) f * X (s) f *Y (s) M X Y ( s ) M X ( s )MY ( s ) IRDM WS 2005 k FA B ( k ) f A( i ) FY ( k i ) i o GA B ( z ) GA ( z )GB ( z ) 2-16 Inequalities and Tail Bounds Markov inequality: P[X t] E[X] / t for t > 0 and non-neg. RV X Chebyshev inequality: P[ |XE[X]| t] Var[X] / t2 for t > 0 and non-neg. RV X Chernoff-Hoeffding bound: P [ X t ] inf e t M X ( ) | 0 1 2nt 2 Corollary: : P X i p t 2e n for Bernoulli(p) iid. RVs X1, ..., Xn and any t > 0 t2 / 2 2e Mill‘s inequality: P Z t t for N(0,1) distr. RV Z and t > 0 Cauchy-Schwarz inequality: E[XY] E[X ]E[Y ] 2 2 Jensen‘s inequality: E[g(X)] g(E[X]) for convex function g E[g(X)] g(E[X]) for concave function g (g is convex if for all c[0,1] and x1, x2: g(cx1 + (1-c)x2) cg(x1) + (1-c)g(x2)) IRDM WS 2005 2-17 Convergence of Random Variables Let X1, X2, ...be a sequence of RVs with cdf‘s F1, F2, ..., and let X be another RV with cdf F. • Xn converges to X in probability, Xn P X, if for every > 0 P[|XnX| > ] 0 as n • Xn converges to X in distribution, Xn D X, if lim n Fn(x) = F(x) at all x for which F is continuous • Xn converges to X in quadratic mean, Xn qm X, if E[(XnX)2] 0 as n • Xn converges to X almost surely, Xn as X, if P[Xn X] = 1 weak law of large numbers (for X n i 1..n Xi / n ) if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n P E[X] that is: lim n P[| X n E[X] | ] 0 strong law of large numbers: if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n as E[X] that is: P[lim n | X n E[X] | ] 0 IRDM WS 2005 2-18 Poisson Approximates Binomial Theorem: Let X be a random variable with binomial distribution with parameters n and p := /n with large n and small constant << 1. Then limn f X ( k ) e IRDM WS 2005 k k! 2-19 Central Limit Theorem Theorem: Let X1, ..., Xn be independent, identically distributed random variables with expectation and variance 2. The distribution function Fn of the random variable Zn := X1 + ... + Xn converges to a normal distribution N(n, n2) with expectation n and variance n2: limn P [ a Z n n n b ] ( b ) ( a ) Corollary: 1 n X : X i n i 1 converges to a normal distribution N(, 2/n) with expectation and variance 2/n . IRDM WS 2005 2-20 Elementary Information Theory Let f(x) be the probability (or relative frequency) of the x-th symbol 1 in some text d. The entropy of the text H ( d ) f ( x ) log2 (or the underlying prob. distribution f) is: f(x) x H(d) is a lower bound for the bits per symbol needed with optimal coding (compression). For two prob. distributions f(x) and g(x) the relative entropy (Kullback-Leibler divergence) of f to g is f(x) D( f g ) : f ( x ) log g( x ) x Relative entropy is a measure for the (dis-)similarity of two probability or frequency distributions. It corresponds to the average number of additional bits needed for coding information (events) with distribution f when using an optimal code for distribution g. The cross entropy of f(x) to g(x) is: H ( f , g ) : H ( f ) D( f g ) f ( x ) log g( x ) x IRDM WS 2005 2-21 Compression • Text is sequence of symbols (with specific frequencies) • Symbols can be • letters or other characters from some alphabet • strings of fixed length (e.g. trigrams) • or words, bits, syllables, phrases, etc. Limits of compression: Let pi be the probability (or relative frequency) of the i-th symbol in text d 1 Then the entropy of the text: H (d ) pi log 2 p i i is a lower bound for the average number of bits per symbol in any compression (e.g. Huffman codes) Note: compression schemes such as Ziv-Lempel (used in zip) are better because they consider context beyond single symbols; with appropriately generalized notions of entropy the lower-bound theorem does still hold IRDM WS 2005 2-22