Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Basics from Probability Theory
and Statistics
2.1 Probability Theory
Events, Probabilities, Random Variables, Distributions, Moments
Generating Functions, Deviation Bounds, Limit Theorems
Basics from Information Theory
2.2 Statistical Inference: Sampling and Estimation
Moment Estimation, Confidence Intervals
Parameter Estimation, Maximum Likelihood, EM Iteration
2.3 Statistical Inference: Hypothesis Testing and Regression
Statistical Tests, p-Values, Chi-Square Test
Linear and Logistic Regression
mostly following L. Wasserman Chapters 1-5, with additions from other textbooks on stochastics
IRDM WS 2005
2-1
2.1 Basic Probability Theory
A probability space is a triple (, E, P) with
• a set of elementary events (sample space),
• a family E of subsets of with E which is closed under
, , and with a countable number of operands
(with finite usually E=2), and
• a probability measure P: E [0,1] with P[]=1 and
P[i Ai] = i P[Ai] for countably many, pairwise disjoint Ai
Properties of P:
P[A] + P[A] = 1
P[A B] = P[A] + P[B] – P[A B]
P[] = 0 (null/impossible event)
P[ ] = 1 (true/certain event)
IRDM WS 2005
2-2
Independence and Conditional Probabilities
Two events A, B of a prob. space are independent
if P[A B] = P[A] P[B].
A finite set of events A={A1, ..., An} is independent
if for every subset S A the equation P[
Ai ] P[Ai ]
Ai S
Ai S
holds.
The conditional probability P[A | B] of A under the
P[ A B]
condition (hypothesis) B is defined as:
P[ A | B]
P[ B]
Event A is conditionally independent of B given C
if P[A | BC] = P[A | C].
IRDM WS 2005
2-3
Total Probability and Bayes’ Theorem
Total probability theorem:
For a partitioning of into events B1, ..., Bn:
n
P[ A] P[ A| Bi ] P[ Bi ]
i 1
Bayes‘ theorem:
P[ B | A] P[ A]
P[ A | B]
P[ B]
P[A|B] is called posterior probability
P[A] is called prior probability
IRDM WS 2005
2-4
Random Variables
A random variable (RV) X on the prob. space (, E, P) is a function
X: M with M R s.t. {e | X(e) x} E for all x M
(X is measurable).
FX: M [0,1] with FX(x) = P[X x] is the
(cumulative) distribution function (cdf) of X.
With countable set M the function fX: M [0,1] with fX(x) = P[X = x]
is called the (probability) density function (pdf) of X;
in general fX(x) is F‘X(x).
For a random variable X with distribution function F, the inverse function
F-1(q) := inf{x | F(x) > q} for q [0,1] is called quantile function of X.
(0.5 quantile (50th percentile) is called median)
Random variables with countable M are called discrete,
otherwise they are called continuous.
For discrete random variables the density function is also
referred to as the probability mass function.
IRDM WS 2005
2-5
Important Discrete Distributions
• Bernoulli distribution with parameter p: P [ X x ] p x ( 1 p )1 x
for x { 0,1}
• Uniform distribution over {1, 2, ..., m}:
P [ X k ] f X (k )
1
m
for1 k m
• Binomial distribution (coin toss n times repeated; X: #heads):
n k
P [ X k ] f X (k ) p (1 p) nk
k
• Poisson distribution (with rate ):
P[ X k ] f X
k
(k ) e
k!
• Geometric distribution (#coin tosses until first head):
P [ X k ] f X (k ) (1 p) k p
• 2-Poisson mixture (with a1+a2=1):
P [ X k ] f X ( k ) a1 e
IRDM WS 2005
1
1k
k!
a2 e
2
2 k
k!
2-6
Important Continuous Distributions
• Uniform distribution in the interval [a,b]
1
f X ( x )
ba
for a x b ( 0 otherwise )
• Exponential distribution (z.B. time until next event of a
Poisson process) with rate = limt0 (# events in t) / t :
f X ( x ) e x
for x 0 ( 0 otherwise )
1x
2 x
f
(
x
)
p
e
(
1
p
)
e
• Hyperexponential distribution: X
1
2
a 1
ab
f
(
x
)
for x b , 0 otherwise
• Pareto distribution: X
b x
c
f
(
x
)
Example of a „heavy-tailed“ distribution with X
x 1
1
• logistic distribution: FX ( x )
1 e x
IRDM WS 2005
2-7
Normal Distribution (Gaussian Distribution)
• Normal distribution N(,2) (Gauss distribution;
approximates sums of independent,
identically distributed random variables): f X ( x)
1
2
2
e
( x )2
2 2
• Distribution function of N(0,1):
z
( z )
1
2
x2
e 2
dx
Theorem:
Let X be normal distributed with
expectation and variance 2.
X
Then Y :
is normal distributed with expectation 0 and variance 1.
IRDM WS 2005
2-8
Multidimensional (Multivariate) Distributions
Let X1, ..., Xm be random variables over the same prob. space
with domains dom(X1), ..., dom(Xm).
The joint distribution of X1, ..., Xm has a density function
f X1,...,X m ( x1 ,...,xm )
with
x1dom( X1 )
or
...
xmdom( X m )
...
dom( X 1 )
f X1 ,...,X m ( x1 ,..., xm ) 1
f X 1,..., Xm ( x1 ,...,xm ) dxm ...dx1 1
dom( X m )
The marginal distribution of Xi in the joint distribution
of X1, ..., Xm has the density function
... ... f X1 ,...,X m ( x1 ,..., xm ) or
x1
xi 1 xi 1
...
X1
... f X1 ,...,X m ( x1 ,..., xm ) dxm ... dxi 1 dxi 1 ... dx1
X i 1 X i 1
IRDM WS 2005
xm
Xm
2-9
Important Multivariate Distributions
multinomial distribution (n trials with m-sided dice):
n k1
km
p1 ... pm
P [ X 1 k1 ... X m k m ] f X1 ,...,X m ( k1 ,..., k m )
k1 ... k m
n
n!
:
with
k1 ... k m k1! ... k m !
multidimensional normal distribution:
f X1 ,...,X m ( x )
1
( 2 )m
1
( x )T 1 ( x )
e 2
with covariance matrix with ij := Cov(Xi,Xj)
IRDM WS 2005
2-10
Moments
For a discrete random variable X with density fX
E [ X ] k f X (k )
is the expectation value (mean) of X
E [ X i ] k i f X (k )
is the i-th moment of X
kM
kM
V [ X ] E [( X E[ X ])2 ] E[ X 2 ] E[ X ]2
is the variance of X
For a continuous random variable X with density fX
E [ X ] x f X ( x) dx is the expectation value of
i
E [ X ] xi f X ( x) dx is the i-th moment of X
2
2
2
V [ X ] E [( X E[ X ]) ] E[ X ] E[ X ]
X
is the variance of X
Theorem: Expectation values are additive: E [ X Y ] E[ X ] E[ Y ]
(distributions are not)
IRDM WS 2005
2-11
Properties of Expectation and Variance
E[aX+b] = aE[X]+b for constants a, b
E[X1+X2+...+Xn] = E[X1] + E[X2] + ... + E[Xn]
(i.e. expectation values are generally additive, but distributions are not!)
E[X1+X2+...+XN] = E[N] E[X]
if X1, X2, ..., XN are independent and identically distributed (iid RVs)
with mean E[X] and N is a stopping-time RV
Var[aX+b] = a2 Var[X] for constants a, b
Var[X1+X2+...+Xn] = Var[X1] + Var[X2] + ... + Var[Xn]
if X1, X2, ..., Xn are independent RVs
Var[X1+X2+...+XN] = E[N] Var[X] + E[X]2 Var[N]
if X1, X2, ..., XN are iid RVs with mean E[X] and variance Var[X]
and N is a stopping-time RV
IRDM WS 2005
2-12
Correlation of Random Variables
Covariance of random variables Xi and Xj::
Cov( Xi, Xj) : E[ ( Xi E[ Xi]) ( Xj E[ Xj]) ]
Var( Xi ) Cov( Xi , Xi ) E [ X 2 ] E [ X ] 2
Correlation coefficient of Xi and Xj
Cov ( Xi, Xj)
( Xi, Xj) :
Var ( Xi) Var ( Xj)
Conditional expectation of X given Y=y:
x f X|Y (x | y)
E[X | Y y]
x f X|Y (x | y) dx
IRDM WS 2005
discrete case
continuous case
2-13
Transformations of Random Variables
Consider expressions r(X,Y) over RVs such as X+Y, max(X,Y), etc.
1. For each z find Az = {(x,y) | r(x,y)z}
2. Find cdf FZ(z) = P[r(x,y) z] =
Az fX,Y (x, y)dx dy
3. Find pdf fZ(z) = F‘Z(z)
Important case: sum of independent RVs (non-negative)
Z = X+Y
FZ(z) = P[r(x,y) z] = x yz f X (x)f Y (y) dx dy
yx
z x z
y0 x 0 f X (x)f Y (y) dx dy
z
f (x)FY (z x)
x 0 X
or in discrete case:
dx
FZ (z) x yz f X (x)f Y (y)
Convolution
x y
IRDM WS 2005
2-14
Generating Functions and Transforms
X, Y, ...: continuous random variables
A, B, ...: discrete random variables with
with non-negative real values
MX(s) e
sx
f X ( x )dx E [ e
sX
non-negative integer values
GA ( z ) z i f A( i ) E[ z A ] :
]:
i 0
0
moment-generating function of X
f *X ( s ) e sx f X ( x )dx E [ e sX ]
generating function of A
(z transform)
0
Laplace-Stieltjes transform (LST) of X
f A* ( s ) M A( s ) GA( e s )
Examples:
exponential:
f X ( x ) e
f *X ( s )
IRDM WS 2005
x
s
Erlang-k:
fX ( x )
k( kx )k 1
( k 1 )!
Poisson:
e
kx
k k
f *X ( s )
k s
f A ( k ) e
k
k!
GA( z ) e ( z 1 )
2-15
Properties of Transforms
s 2 E[ X 2 ] s 3 E[ X 3 ]
M X ( s ) 1 sE[ X ]
...
2!
3!
d nM X ( s )
E[ X ]
(0 )
n
ds
n
1 d nG A ( z )
f A( n )
(0 )
n
n!
dz
dGA ( z )
E[ A]
(1)
dz
f X ( x ) ag( x ) bh( x ) f * ( s ) ag* ( s ) bh* ( s )
f X ( x ) g'( x ) f * ( s ) sg* ( s ) g(0 )
x
f X ( x ) g( t )dt f * ( s )
0
g*( s )
s
Convolution of independent random variables:
z
FX Y ( z ) f X ( x) FY ( z x) dx
0
f * X Y (s) f * X (s) f *Y (s)
M X Y ( s ) M X ( s )MY ( s )
IRDM WS 2005
k
FA B ( k ) f A( i ) FY ( k i )
i o
GA B ( z ) GA ( z )GB ( z )
2-16
Inequalities and Tail Bounds
Markov inequality: P[X t] E[X] / t
for t > 0 and non-neg. RV X
Chebyshev inequality: P[ |XE[X]| t] Var[X] / t2
for t > 0 and non-neg. RV X
Chernoff-Hoeffding bound: P [ X t ] inf e t M X ( ) | 0
1
2nt 2
Corollary: : P X i p t 2e
n
for Bernoulli(p) iid. RVs
X1, ..., Xn and any t > 0
t2 / 2
2e
Mill‘s inequality: P Z t
t
for N(0,1) distr. RV Z
and t > 0
Cauchy-Schwarz inequality: E[XY] E[X ]E[Y ]
2
2
Jensen‘s inequality: E[g(X)] g(E[X]) for convex function g
E[g(X)] g(E[X]) for concave function g
(g is convex if for all c[0,1] and x1, x2: g(cx1 + (1-c)x2) cg(x1) + (1-c)g(x2))
IRDM WS 2005
2-17
Convergence of Random Variables
Let X1, X2, ...be a sequence of RVs with cdf‘s F1, F2, ...,
and let X be another RV with cdf F.
• Xn converges to X in probability, Xn P X, if for every > 0
P[|XnX| > ] 0 as n
• Xn converges to X in distribution, Xn D X, if
lim n Fn(x) = F(x) at all x for which F is continuous
• Xn converges to X in quadratic mean, Xn qm X, if
E[(XnX)2] 0 as n
• Xn converges to X almost surely, Xn as X, if P[Xn X] = 1
weak law of large numbers (for X n i 1..n Xi / n )
if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n P E[X]
that is: lim n P[| X n E[X] | ] 0
strong law of large numbers:
if X1, X2, ..., Xn, ... are iid RVs with mean E[X], then X n as E[X]
that is: P[lim n | X n E[X] | ] 0
IRDM WS 2005
2-18
Poisson Approximates Binomial
Theorem:
Let X be a random variable with binomial distribution with
parameters n and p := /n with large n and small constant << 1.
Then limn f X ( k ) e
IRDM WS 2005
k
k!
2-19
Central Limit Theorem
Theorem:
Let X1, ..., Xn be independent, identically distributed random variables
with expectation and variance 2.
The distribution function Fn of the random variable Zn := X1 + ... + Xn
converges to a normal distribution N(n, n2)
with expectation n and variance n2:
limn P [ a
Z n n
n
b ] ( b ) ( a )
Corollary:
1 n
X : X i
n i 1
converges to a normal distribution N(, 2/n)
with expectation and variance 2/n .
IRDM WS 2005
2-20
Elementary Information Theory
Let f(x) be the probability (or relative frequency) of the x-th symbol
1
in some text d. The entropy of the text
H ( d ) f ( x ) log2
(or the underlying prob. distribution f) is:
f(x)
x
H(d) is a lower bound for the bits per symbol
needed with optimal coding (compression).
For two prob. distributions f(x) and g(x) the
relative entropy (Kullback-Leibler divergence) of f to g is
f(x)
D( f g ) : f ( x ) log
g( x )
x
Relative entropy is a measure for the (dis-)similarity of
two probability or frequency distributions.
It corresponds to the average number of additional bits
needed for coding information (events) with distribution f
when using an optimal code for distribution g.
The cross entropy of f(x) to g(x) is:
H ( f , g ) : H ( f ) D( f g ) f ( x ) log g( x )
x
IRDM WS 2005
2-21
Compression
• Text is sequence of symbols (with specific frequencies)
• Symbols can be
• letters or other characters from some alphabet
• strings of fixed length (e.g. trigrams)
• or words, bits, syllables, phrases, etc.
Limits of compression:
Let pi be the probability (or relative frequency)
of the i-th symbol in text d
1
Then the entropy of the text: H (d ) pi log 2 p
i
i
is a lower bound for the average number of bits per symbol
in any compression (e.g. Huffman codes)
Note:
compression schemes such as Ziv-Lempel (used in zip)
are better because they consider context beyond single symbols;
with appropriately generalized notions of entropy
the lower-bound theorem does still hold
IRDM WS 2005
2-22