Discrete random variables Probability mass function Given a Download

Transcript
Discrete random variables
Probability mass function
Given a discrete random variable X taking values in X = {v1 , . . . , vm }, its probability mass function P : X →
[0, 1] is defined as:
P (vi ) = Pr[X = vi ]
and satisfies the following conditions:
• P (x) ≥ 0
P
•
x∈X P (x) = 1
Discrete random variables
Expected value
• The expected value, mean or average of a random variable x is:
E[x] = µ =
X
xP (x) =
vi P (vi )
i=1
x∈X
• The expectation operator is linear:
m
X
E[λx + λ0 y] = λE[x] + λ0 E[y]
Variance
• The variance of a random variable is the moment of inertia of its probability mass function:
X
Var[x] = σ 2 = E[(x − µ)2 ] =
(x − µ)2 P (x)
x∈X
• The standard deviation σ indicates the typical amount of deviation from the mean one should expect for a
randomly drawn value for x.
Properties of mean and variance
second moment
E[x2 ] =
X
x2 P (x)
x∈X
variance in terms of expectation
Var[x] = E[x2 ] − E[x]2
variance and scalar multiplication
Var[λx] = λ2 Var[x]
variance of uncorrelated variables
Var[x + y] = Var[x] + Var[y]
1
Probability distributions
Bernoulli distribution
• Two possible values (outcomes): 1 (success), 0 (failure).
• Parameters: p probability of success.
• Probability mass function:
P (x; p) =
p
if x = 1
1 − p if x = 0
• E[x] = p
• Var[x] = p(1 − p)
Example: tossing a coin
• Head (success) and tail (failure) possible outcomes
• p is probability of head
Bernoulli distribution
Proof of mean
E[x]
=
X
xP (x)
x∈X
X
=
xP (x)
x∈{0,1}
=
0 · (1 − p) + 1 · p = p
Bernoulli distribution
Proof of variance
Var[x]
=
X
(x − µ)2 P (x)
x∈X
=
X
(x − p)2 P (x)
x∈{0,1}
=
(0 − p)2 · (1 − p) + (1 − p)2 · p
= p2 · (1 − p) + (1 − p) · (1 − p) · p
=
(1 − p) · (p2 + p − p2 )
=
(1 − p) · p
2
Probability distributions
Binomial distribution
• Probability of a certain number of successes in n independent Bernoulli trials
• Parameters: p probability of success, n number of trials.
• Probability mass function:
n
P (x; p, n) =
px (1 − p)n−x
x
• E[x] = np
• Var[x] = np(1 − p)
Example: tossing a coin
• n number of coin tosses
• probability of obtaining x heads
Pairs of discrete random variables
Probability mass function
Given a pair of discrete random variables X and Y taking values X = {v1 , . . . , vm } Y = {w1 , . . . , wn }, the joint
probability mass function is defined as:
P (vi , wj ) = Pr[X = vi , Y = wj ]
with properties:
• P (x, y) ≥ 0
P
P
•
x∈X
y∈Y P (x, y) = 1
Properties
• Expected value
XX
µx = E[x] =
xP (x, y)
x∈X y∈Y
XX
µy = E[y] =
yP (x, y)
x∈X y∈Y
• Variance
XX
σx2 = Var[(x − µx )2 ] =
(x − µx )2 P (x, y)
x∈X y∈Y
σy2
XX
2
= Var[(y − µy ) ] =
(y − µy )2 P (x, y)
x∈X y∈Y
• Covariance
XX
σxy = E[(x − µx )(y − µy )] =
x∈X y∈Y
• Correlation coefficient
ρ=
3
σxy
σx σy
(x − µx )(y − µy )P (x, y)
Probability distributions
Multinomial distribution (one sample)
• Models the probability of a certain outcome for an event with m possible outcomes.
• Parameters: p1 , . . . , pm probability of each outcome
• Probability mass function:
P (x1 , . . . , xm ; p1 , . . . , pm ) =
m
Y
pxi i
i=1
• where x1 , . . . , xm is a vector with xi = 1 for outcome i and xj = 0 for all j 6= i.
• E[xi ] = pi
• Var[xi ] = pi (1 − pi )
• Cov[xi , xj ] = −pi pj
Probability distributions
Multinomial distribution: example
• Tossing a dice with six faces:
– m is the number of faces
– pi is probability of obtaining face i
Probability distributions
Multinomial distribution (general case)
• Given n samples of an event with m possible outcomes, models the probability of a certain distribution of
outcomes.
• Parameters: p1 , . . . , pm probability of each outcome, n number of samples.
Pm
• Probability mass function (assumes i=1 xi = n):
n!
P (x1 , . . . , xm ; p1 , . . . , pm , n) = Qm
i=1 xi !
• E[xi ] = npi
• Var[xi ] = npi (1 − pi )
• Cov[xi , xj ] = −npi pj
4
m
Y
i=1
pxi i
Probability distributions
Multinomial distribution: example
• Tossing a dice
– n number of times a dice is tossed
– xi number of times face i is obtained
– pi probability of obtaining face i
Conditional probabilities
conditional probability probability of x once y is observed
P (x|y) =
P (x, y)
P (y)
statistical independence variables X and Y are statistical independent iff
P (x, y) = P (x)P (y)
implying:
P (x|y) = P (x)
P (y|x) = P (y)
Basic rules
law of total probability The marginal distribution of a variable is obtained from a joint distribution summing over
all possible values of the other variable (sum rule)
P (x) =
X
P (x, y)
X
P (y) =
y∈Y
x∈X
product rule conditional probability definition implies that
P (x, y) = P (x|y)P (y) = P (y|x)P (x)
Bayes’ rule
P (y|x) =
P (x|y)P (y)
P (x)
Bayes’ rule
Significance
P (y|x) =
P (x|y)P (y)
P (x)
• allows to “invert” statistical connections between effect (x) and cause (y):
posterior =
likelihood × prior
evidence
• evidence can be obtained using the sum rule from likelihood and prior:
X
X
P (x) =
P (x, y) =
P (x|y)P (y)
y
y
5
P (x, y)
Playing with probabilities
Use rules!
• Basic rules allow to model a certain probability (e.g. cause given effect) given knowledge of some related ones
(e.g. likelihood, prior)
• All our manipulations will be applications of the three basic rules
• Basic rules apply to any number of varables:
P (y)
XX
=
x
XX
=
x
P (x, y, z) (sum rule)
z
P (y|x, z)P (x, z) (product rule)
z
X X P (x|y, z)P (y|z)P (x, z)
=
x
P (x|z)
z
(Bayes rule)
Playing with probabilities
Example
P (y|x, z)
=
=
=
=
=
=
P (x, z|y)P (y)
(Bayes rule)
P (x, z)
P (x, z|y)P (y)
(product rule)
P (x|z)P (z)
P (x|z, y)P (z|y)P (y)
(product rule)
P (x|z)P (z)
P (x|z, y)P (z, y)
(product rule)
P (x|z)P (z)
P (x|z, y)P (y|z)P (z)
(product rule)
P (x|z)P (z)
P (x|z, y)P (y|z)
P (x|z)
Continuous random variables
Cumulative distribution function
• How to generalize probability mass function to continuous domains?
• Consider probability of intervals, e.g.
W = (a < X ≤ b) A = (X ≤ a) B = (X ≤ b)
• W and A are mutually exclusive, thus:
P (W ) = P (B) − P (A)
P (B) = P (A) + P (W )
• We call F (q) = P (X ≤ q) the cumulative distribution function (cdf) of X (monotonic function)
• The probability of an interval is the difference of two cdf:
P (a < X ≤ b) = F (b) − F (a)
6
Continuous random variables
Probability density function
• The derivative of the cdf is called probability density function (pdf):
p(x) =
d
F (x)
dx
• The probability of an interval can be computing integrating the pdf:
Z
P (a < X ≤ b) =
b
p(x)dx
a
• Properties:
– p(x) ≥ 0
R∞
– −∞ p(x)dx = 1
Continuous random variables
Pointwise probability
• The probability of a specific value x0 is given by:
1
P (X = x0 ) = lim P (x0 < X ≤ x0 + )
→0 Note
• The pdf of a value x can be greater than one, provided the integral is one.
• E.g. let p(x) be a uniform distribution over [a, b]:
p(x) = U nif (x; a, b) =
1
(a ≤ x ≤ b)
b−a
• For a = 0 and b = 1/2, p(x) = 2 for all x ∈ [0, 1/2] (but the integral is one)
Properties
expected value
Z
∞
E[x] = µ =
xp(x)dx
−∞
variance
2
Z
∞
Var[x] = σ =
(x − µ)2 p(x)dx
−∞
Note
Definitions and formulas for discrete random variables carry over to continuous random variables with sums replaced by integrals
7
Probability distributions
Gaussian (or normal) distribution
• Bell-shaped curve.
• Parameters: µ mean, σ 2 variance.
• Probability density function:
p(x; µ, σ) = √
1
(x − µ)2
exp −
2σ 2
2πσ
• E[x] = µ
• Var[x] = σ 2
• Standard normal distribution: N (0, 1)
• Standardization of a normal distribution N (µ, σ 2 )
z=
x−µ
σ
Probability distributions
Beta distribution
• Defined in the interval [0, 1]
• Parameters: α, β
• Probability density function:
p(x; α, β) =
Γ(α + β) α−1
x
(1 − x)β−1
Γ(α)Γ(β)
8
• E[x] =
α
α+β
• Var[x] =
Γ(x + 1) = xΓ(x), Γ(1) = 1
αβ
(α+β)2 (α+β+1)
Note
It models the posterior distribution of parameter p of a binomial distribution after observing α − 1 independent
events with probability p and β − 1 with probability 1 − p.
Probability distributions
Multivariate normal distribution
• normal distribution for d-dimensional vectorial data.
• Parameters: µ mean vector, Σ covariance matrix.
• Probability density function:
p(x; µ, Σ) =
1
1
exp − (x − µ)T Σ−1 (x − µ)
2
(2π)d/2 |Σ|1/2
• E[x] = µ
• Var[x] = Σ
9
• squared Mahalanobis distance from x to µ is standard measure of distance to mean:
r2 = (x − µ)T Σ−1 (x − µ)
Probability distributions
Dirichlet distribution
Pm
• Defined: x ∈ [0, 1]m , i=1 xi = 1
• Parameters: α = α1 , . . . , αm
• Probability density function:
m
Γ(α0 ) Y αi −1
xi
p(x1 , . . . , xm ; α) = Qm
i=1 Γ(αi ) i=1
10
• E[xi ] =
αi
α0
• Var[xi ] =
where α0 =
αi (α0 −αi )
α20 (α0 +1)
Pm
Cov[xi , xj ] =
j=1
αj
−αi αj
α20 (α0 +1)
Note
It models the posterior distribution of parameters p of a multinomial distribution after observing αi − 1 times each
mutually exclusive event
Probability laws
Expectation of an average
Consider a sample of X1 , . . . , Xn i.i.d instances drawn from a distribution with mean µ and variance σ 2 .
• Consider the random variable X̄n measuring the sample average:
X̄n =
X1 + · · · + Xn
n
11
• Its expectation is computed as (E[a(X + Y )] = a(E[X] + E[Y ])):
E[X̄n ] =
1
(E[X1 ] + · · · + E[Xn ]) = µ
n
• i.e. the expectation of an average is the true mean of the distribution
Probability laws
variance of an average
• Consider the random variable X̄n measuring the sample average:
X̄n =
X1 + · · · + Xn
n
• Its variance is computed as (Var[a(X + Y )] = a2 (Var[X] + Var[Y ]) for X and Y independent):
Var[X̄n ] =
σ2
1
(Var[X1 ] + · · · + Var[Xn ]) =
2
n
n
• i.e. the variance of the average decreases with the number of observations (the more examples you see, the more
likely you are to estimate the correct average)
Probability laws
Chebyshev’s inequality
Consider a random variable X with mean µ and variance σ 2 .
• Chebyshev’s inequality states that for all a > 0:
Pr[|X − µ| ≥ a] ≤
σ2
a2
• Replacing a = kσ for k > 0 we obtain:
Pr[|X − µ| ≥ kσ] ≤
1
k2
Note
Chebyshev’s inequality shows that most of the probability mass of a random variable stays within few standard
deviations from its mean
Probability laws
The law of large numbers
Consider a sample of X1 , . . . , Xn i.i.d instances drawn from a distribution with mean µ and variance σ 2 .
• For any > 0, its sample average X̄n obeys:
lim Pr[|X̄n − µ| > ] = 0
n→∞
• It can be shown using Chebyshev’s inequality and the facts that E[X̄n ] = µ, Var[X̄n ] = σ 2 /n:
Pr[|X̄n − E[X̄n ]| ≥ ] ≤
σ2
n2
Interpretation
• The accuracy of an empirical statistic increases with the number of samples
12
Probability laws
Central Limit theorem
Consider a sample of X1 , . . . , Xn i.i.d instances drawn from a distribution with mean µ and variance σ 2 .
1. Regardless of the distribution of Xi , for n → ∞, the distribution of the sample average X̄n approaches a Normal
distribution
2. Its mean approaches µ and its variance approaches σ 2 /n
3. Thus the normalized sample average:
z=
X̄n − µ
√σ
n
approaches a standard Normal distribution N (0, 1).
Central Limit theorem
Interpretation
• The sum of a sufficiently large sample of i.i.d. random measurements is approximately normally distributed
• We don’t need to know the form of their distribution (it can be arbitrary)
• Justifies the importance of Normal distribution in real world applications
Information theory
Entropy
• Consider a discrete set of symbols V = {v1 , . . . , vn } with mutually exclusive probabilities P (vi ).
• We aim a designing a binary code for each symbol, minimizing the average length of messages
• Shannon and Weaver (1949) proved that the optimal code assigns to each symbol vi a number of bits equal to
− log P (vi )
• The entropy of the set of symbols is the expected length of a message encoding a symbol assuming such optimal
coding:
n
X
H[V] = E[− log P (v)] = −
P (vi ) log P (vi )
i=1
Information theory
Cross entropy
• Consider two distributions P and Q over variable X
• The cross entropy between P and Q measures the expected number of bits needed to code a symbol sampled
from P using Q instead
H(P ; Q) = EP [− log Q(v)] = −
n
X
P (vi ) log Q(vi )
i=1
Note
It is often used as a loss for binary classification, with P (empirical) true distribution and Q (empirical) predicted
distribution.
13
Information theory
Relative entropy
• Consider two distributions P and Q over variable X
• The relative entropy or Kullback-Leibler (KL) divergence measures the expected length difference when coding
instances sampled from P using Q instead:
DKL (p||q) = H(P ; Q) − H(P )
n
n
X
X
=−
P (vi ) log Q(vi ) +
P (vi ) log P (vi )
i=1
=
n
X
i=1
P (vi ) log
i=1
P (vi )
Q(vi )
Note
The KL-divergence is not a distance (metric) as it is not necessarily symmetric
Information theory
Conditional entropy
• Consider two variables V, W with (possibly different) distributions P
• The conditional entropy is the entropy remaining for variable W once V is known:
X
H(W |V ) =
P (v)H(W |V = v)
v
=−
X
P (v)
v
X
P (w|v) log P (w|v)
w
Information theory
Mutual information
• Consider two variables V, W with (possibly different) distributions P
• The mutual information (or information gain) is the reduction in entropy for W once V is known:
I(W ; V ) = H(W ) − H(W |V )
X
X
X
=−
p(w) log p(w) +
P (v)
P (w|v) log P (w|v)
w
v
w
Note
It is used e.g. in selecting the best attribute to use in building a decision tree, where V is the attribute and W is the
label.
14