Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1/2/13
I529: Machine Learning in Bioinformatics (Spring 2013)
Definitions
Probabilistic models
– A model means a system that simulates the object under
consideration
– A probabilistic model is one that produces different
outcomes with different probabilities (BSA)
Probabilistic models
Yuzhen Ye
School of Informatics and Computing
Indiana University, Bloomington
Spring 2013
Why probabilistic models
Probability
The biological system being analyzed is
stochastic
Or noisy
Or completely deterministic, but because a
number of hidden variables effecting its behavior
are unknown, the observed data might be best
explained with a probabilistic model
Experiment: a procedure involving chance that
leads to different results
Outcome: the result of a single trial of an
experiment
Event: one or more outcomes of an experiment
Probability: the measure of how likely an event
is
– Between 0 (will not occur) and 1 (will occur)
Templates for equations
Yuzhen Ye
January 1, 2013
Example: a fair 6-sided dice
Random variable
1
Equation for slides
Random variables
are functions that assign a
L
Y
(nisi /N )
unique
number
possible outcome of an
R(S|P
) = to each
experiment i=1 bsi
An example
Outcome: The possible outcomes of this
experiment are 1, 2, 3, 4, 5 and 6
Events: 1; 6; even
Probability: outcomes are equally likely to occur
X=
– Experiment: tossing a coin
– Outcome space: {heads, tails}
⇢
–
1 if heads
X=
0 if tails
– P(A) = The Number Of Ways Event A Can Occur / The Total
Number Of Possible Outcomes
– P(1)=P(6)=1/6; P(even)=3/6=1/2;
2
(1)
(2)
– More
exactly,
X is a discrete random variable
Mutual Information
and
others
– P(X=1)=1/2, P(X=0)=1/2
X
fx y
M Iij =
fxi yj log2 i j
fxi fxj
xy
(3)
i j
2
2/5
3
3/5
M Iij = log2
+ log2
=0
5
1 ⇤ 2/5 5
1 ⇤ 3/5
k( i ,
k( i ,
j)
=
j)
= hLi e
1
X
n=m=0
k( i ,
j)
=
1
X
n=0
n n
n!n!
Hi
n m
n!m!
hLi Hin LTi , Lj Hjn LTj i +
LTi , Lj e
Hj
LTj i
(5)
hLi Hin LTi , Lj Hjm LTj i
1
X
n=m=0,n6=m
n m
n!n!
(4)
hLi Hin LTi , Lj Hjm LTj i
(6)
(7)
1
1/2/13
Templates for equations
Yuzhen Ye
January 1, 2013
Probability distribution
1
Probability
Equation
for slidesmass function (pmf)
L
Y (n
A probability mass function
isisai /N
function
that
)
R(S|P ) =
gives the probability thati=1
a discrete
random
bs i
variable is exactly equal to some value; it is often
the primary means ⇢
of defining a discrete
1 if heads
X=
probability distribution
0 if tails
An example
Probability distribution: the assignment of a
probability P(x) to each outcome x.
A fair dice: outcomes are equally likely to occur
the probability distribution over the all six
outcomes
P(x)=1/6,
or 6.
Templates
forx=1,2,3,4,5
equations
Templates
for equations
A loaded dice: outcomes are unequally likely to
occur the probability
over the all six
Yuzhen Yuzhen
Yedistribution
Ye
outcomes
P(x)=f(x),
Templates
forx=1,2,3,4,5
equationsor 6, but ∑f(x)=1.
P (X) =
JanuaryJanuary
2, 2013 2, 2013
Yuzhen Ye
1
1
Equation
for slides
January 2, 2013
1 Equation
for slides
Equation for slides
random random
variable variable
L
L
Y
Y
(nis /N
)(nisi /N )
R(S|P ) =
R(S|P ) =i
bs i
bs i
i=1
2
(1)
k( i ,
(2)
(2)
(3)
= hLi e
The probability of1 a DNA sequence
Event: Observing a DNA sequence S=s1s2…sn:
si ∈ {A,C,G,T};
Random sequence model (or Independent and
identically-distributed, i.i.d. model): si occurs at
random with the probability P(si), independent
of all other
residues in the sequence;
n
P(S)= ∏ P (si )
i =1
This model will be used as a background
model (or called a null hypothesis).
Hi
LTi , Lj e
Hj
LTj i
(3)
(4)
n m
j) =
(2)
1
(3)
(3)
(4)
(4)
(4)
1
X
1
n n
n m
X
n joint
T
n T
nY T
m T
– P(X,Y)
probability
(distribution)
Yuzhen Yeof XhLand
hLi H
i H i L i , Lj H j L j i
i L i , Lj H j L j i +
n!n!
n!n!
n=0 – P(X,Y)=P(X|Y)P(Y)=P(Y|X)P(X)
n=m=0,n6=m
January 2, 2013
– P(X|Y)=P(X), X and Y are independent
(5)
– P(y): y=D1 or D2
L
Y
(nisi /N )
– P(i, D1)=P(i| D1)P(D1) R(S|P ) =
bs i
i=1
– P(i| D1)=P(i| D2), independent events
(5)
(1)
random variable
X=
(5)
(5)
(6)
Example: experiment 11 (selecting a dice),
experiment
2 (rolling
Equation for
slidesthe selected dice)
1
P (x, y)1
P (x, y)
=R 1
P (y)
P (x, y)dy
(2)
Two experiments (random variables) X and Y
k( i ,
P (X = x) =
Z 0b
Z b
P (a XP
f (x)dx
(ab)=X b)
=
f (x)dx
Z ab
a
P (a X b) =
f (x)dx
marginalmarginal
probability
probability
a
Cumulative distribution
(cdf)
Z Z x function
Z
P (x)
==P (x)
P (x,
=fy)dy
P (x, y)dy
P (X
x)
(t)d(t)
P (x|y) ⌘
j)
1
X
A pdf must be integrated over an interval to yield
conditional probability
f xi y j
fxi fxj
Joint k(
probability
hLi Hin LTi , Lj Hjm LTj i
i, j ) =
n!m!
n=m=0
Templates for equations
probability densityafunction
probability, since
P (X = a)
= 0= x) = 0
P (X
Z
P (x, y) P (x,Py)(x, y) P (x, y)
(x)
= ⌘
P (x,
P (x|y)P ⌘
= Ry)dy = R
P (x|y)
P (y)
P (x, y)dyP (x, y)dy
P (y)
fxi yj log2
2
2/5
3
3/5
M Iij = log2
+ log2
=0
5
1 ⇤ 2/5 5
1 ⇤ 3/5
(1)
Probability density function (pdf)
conditional
probability
marginal
probability
conditional
probability
X
xi y j
i=1
L
Y
(nis /N )
R(S|P⇢) =
⇢ i
1 ifi=1heads
1bsifi heads
X=
X=
0 if tails 0 if tails
random variable
⇢
1 if heads
probability
mass function
probability
mass function
X=
0 if tails
8
8
probability mass function
< 1/2 heads
< 1/2 heads
Probability
density
functions
(pdf) are for
1/2
P (X)
=8
P (X)
=tails1/2 tails
:
:
0 than
others
heads
0 others random
< 1/2
continuous rather
discrete
P (X)
variables;
f(x)= 1/2 tails
probability
density function
probability
density function: 0 others
8
< 1/2 heads
1/2 tails
:
0 others
Mutual Information and others
M Iij =
(1)
(1)
⇢
1 if heads
0 if tails
probability mass function
P (X) =
(6)
probability density function
Marginal
probability
8
< 1/2 heads
1/2 tails
:
0 others
P (X
= a) = 0 (the
The distribution of the marginal
variables
marginal distribution) is obtained by marginalizing
Z b
over the distribution of the
variables being
discarded
P (a X b) =
f (x)dx
(so the discarded variables are marginalized
out)
a
marginal P(X|Y)P(Y)
probability
P(X)=∑
Y
Z
Example: experiment 1 (selecting a dice),
P (x) = P (x, y)dy
experiment 2 (rolling the selected dice)
(2)
(3)
(4)
– P(y): y=D1 probability
or D2
conditional
– P(i) =P(i| D1)P(D1)+P(i| D2)P(D2)
– P(i| D1)=P(i| D2), independent events
P P(i|
(x|y)
– P(i)= P(i| D1)(P(D1)+P(D2))=
D1)⌘
P (x, y)
P (x, y)
=R
P (y)
P (x, y)dy
(5)
1
2
1 if heads
0 if tails
X=
probability mass function
P (X) =
probability density function
8
< 1/2 heads
1/2 tails
:
0 others
1/2/13
P (X = x) = 0
P (a X b) =
P (X x) =
Z
Z
(2)
b
f (x)dx
x
f (t)d(t)
1
Conditional probability
marginal probability
Conditioning the joint distribution on a particular
Z
observation
P (x) = P (x, y)dy
Conditional probability P(X|Y): the measure of how
conditional probability
likely an event X happens under the condition Y;
P (x|y) ⌘
(3)
a
P (x, y)
P (x, y)
=R
P (y)
P (x, y)dy
(4)
Probability models
(5)
A system that produces different outcomes with
different probabilities.
(6)
It can simulate a class of objects (events),
assigning each an associated probability.
– Example: two dices D1, D2
Simple objects (processes) probability
distributions
• P(i|D1) probability for
1 picking i using dicer D1
• P(i|D2) probability for picking i using dicer D2
Binomial distribution
Typical probability distributions
Binomial distribution
Gaussian distribution
Multinomial distribution
Poisson distribution
Dirichlet distribution
An experiment with binary outcomes: 0 or 1;
Probability distribution of a single experiment:
P(‘1’)=p and P(‘0’) = 1-p;
Probability distribution of N tries of the same
experiment
⎛ N ⎞
k
N −k
Bi(k ‘1’s out of N tries) ~ ⎜⎜⎝ k ⎟⎟⎠ p (1 − p)
Probabilitydistributions
distributions
Probability
YuzhenYeYe
Yuzhen
January2, 2,2013
2013
January
CommonDistributions
Distributions
Gaussian distribution
1 1 Common
Gaussian distribution
When N -> ∞, Bi -> Gaussian distribution
Gaussian
Distribution.
• •Gaussian
Distribution.
The
Gaussian
(normal)
distribution
is aprobability distribution that
The
Gaussian
normal)
distribution
a continuous
The
Gaussian
(or(or
normal)
distribution
is is
a continuous
probability distribution that
continuous
probability
distribution
with probability
has
a
bell-shaped
probability
density
function
as:
has a bell-shaped probability density function as:
density function defined as:
1 x µ 2
1
2
f (x;
) = p1p e e12 ( x2 ( µ )2 )
f (x;
µ,µ,
↵2↵) =
(1)(1)
2⇡2⇡
µ: mean (expectation); σ :2 variance (σ: the standard
where
the
mean
or
expectation,
and 2 is is
variance
the
standard
deviation).
where
µµ
is is
the
mean
or
expectation,
and
variance
( ( is is
the
standard
deviation).
derivation)
If
we
define
a
new
variable
u=
(x µ)/
µ)/, we
, we
have,
If we define a new variable
u=
(x
have,
If we define
a new
variable
u=(x-µ)/σ
2
u2 /2
1 1 eu2 /2
f (x)
f (x)
⇠ ⇠p p2⇡
e
2⇡
(2)(2)
Figure from Wikipedia
standard normal distribution when µ = 0 and σ2 =1
Binominal
Distribution.
• •Binominal
Distribution.
Thebinomial
binomialdistribution
distributiongives
givesthe
thediscrete
discreteprobability
probabilitydistribution
distributionof ofobtaining
obtaining
The
exactly
successes
out
Bernoulli
trials
(where
the
result
each
Bernoulli
trial
exactly
nn
successes
out
of of
N NBernoulli
trials
(where
the
result
of of
each
Bernoulli
trial
true
with
probability
p and
false
with
probability
q=
is is
true
with
probability
of of
p and
false
with
probability
of of
q=
1 1 p):p):
N
n (N n)
n)
P (n|N
) =N npnpq (N
q
P (n|N
)=
n
(3)(3)
Theprobability
probabilityof ofobtaining
obtainingmore
moresuccesses
successesthan
thanthe
then nobserved
observedin ina abinomial
binomialdisdisThe
tribution
tribution
is:is:
N
NX
X
N
k (N k)
q k)
N k kp (N
P P= =
(4)(4)
k p q
k=n+1
k=n+1
3
1/2/13
events occur with a known average rate ( ) and independently of the time since the
last event (i.e., it is a Poisson process):
p (n) =
n
e
(5)
n!
Multinomial distribution
I receive 10 emails everyday on average, what’s my chance of receiving
Example: a fair dice
Example 1:
no email today?
Example 2: An experiment with K independent outcomes
Probability: outcomes (1,2,…,6) are equally
likely to occur
with probabilities θi, i =1,…,K, ∑θi =1.
• Multinominal Distribution.
Probability
distributionoutcomes
of N tries
of the
same
An
with K independent
with
probabilities
✓i for i = 1, , K,
P experiment experiment,
getting
nn
i occurrences of outcome i, ∑ni
✓
=
1.
The
probability
of
getting
occurrences
of
outcome
i is giving by (n =
i
i
=N (n={ni}).
{ni }),
K
Y
P (n|✓) = M 1 (n)
✓ini
(6)
Probability of rolling 1 dozen times (12) and
getting each outcome twice:
–
12
()
12! 1
26 6
~3.4×10-3
i=1
M (n) =
Q
ni !
n1 !n2 ! · · · nK !
P
= Pi
( k nk )!
( k nk )!
(7)
• Gamma Distribution.
The gamma distribution has been used to model the rate of evolution at di↵erent
sites in DNA sequences. The gamma distribution is conjugate to the Poisson which
gives the probability of seeing n events over some interval, when there is a probability
p of an individual event occurring in that interval. Since the number of events in an
interval is a rate, the gamma distribution is appropriate for modeling probabilities of
rates.
• Dirichlet Distribution In Bayesian statistics, we need distributions over probability
parameters to use as prior distributions. A natural choice for a density over probaa loaded dice
bilities is theExample:
Dirichlet distribution:
2
2.1
€
Poisson distribution
Poisson gives the probability of seeing n events
over some interval, when there is a probability p
of an individual event occurring in that period.
Probability: outcomes (1,2,…,6) are
unequally likely to occur: P(6)=0.5,
Logistic Linear
Regression
P(1)=P(2)=…=P(5)=0.1
Fitting
Liker other regressions, logistic regression makes use of one or more predictor variables that
may be either continuous or categorical data. Also, like other linear regression models, the
Probability of rolling 1 dozen times (12) and
expected value (average value) of the response variable is fit to the predictors—the expected
getting each outcome twice:
value of a Bernoulli distribution
the probability of a case. But, unlike ordinary
2 is simply
10
12!
-4
– 2regression
6 (0.5) × (0.1)
linear regression, logistic
is used~1.87×10
for predicting binary outcomes (Bernoulli trials)
instead of continuous values. Given this di↵erence, it is necessary that logistic regression
take the natural logarithm of the odds (referred to as the logit or log-odds) to create a
€
2
Poisson distribution for sequencing coverage
modeling
Poisson distribution
C
Assuming uniform distribution of reads:
Length of genomic segment: L
Number of reads:
n
Length of each read:
l
Coverage λ = n l / L
How much coverage is enough (or what is sufficient oversampling)?
Lander-Waterman model: P(x) = (λx * e-λ ) / x!
P(x=0) = e-λ
where λ is coverage
4
K
Y
P (n|✓) = M 1 (n)
✓ini
(6)
• Gamma Distribution.
i=1
The gamma distribution has been used to model theQrate of evolution at di↵erent
ni !
n1 !n
sites in DNA sequences. The gamma
distribution
to the Poisson which
2 ! · · · nK ! is conjugate
P
M (n) =
= Pi
(7)
( k nwhen
gives the probability of seeing n events( over
interval,
there is a probability
k )!
k )!
k nsome
p of an individual event occurring in that interval. Since the number of events in an
•interval
Gamma
is Distribution.
a rate, the gamma distribution is appropriate for modeling probabilities of
The gamma distribution has been used to model the rate of evolution at di↵erent
rates.
1 ↵
sites in DNA sequences. The gamma
is conjugate to the Poisson which
e x x↵ distribution
g(x,of↵,seeing
) = n events over fsome
or 0 interval,
< x, ↵, when
< 1 there is a probability
(8)
gives the probability
(↵)
p of an individual event occurring in that interval. Since the number of events in an
• Dirichlet
Bayesian
statistics,
need distributions
overprobabilities
probabilityof
interval Distribution
is a rate, the In
gamma
distribution
is we
appropriate
for modeling
parameters
to use as prior distributions. A natural choice for a density over probarates.
x
↵
1
↵
e
x
bilities is the Dirichlet distribution:
g(x, ↵, ) =
f or 0 < x, ↵, < 1
(8)
(↵)K
K
Y ↵ 1 X
Dirichlet Indistribution
= Z 1 (↵)
✓i i we (need✓idistributions
1)
(9)
• Dirichlet DistributionD(✓|↵)
Bayesian
statistics,
over probability
i=1
i=1
parameters to use as prior distributions.
A natural
choice for a density over proba ,Outcomes:
θ=(θ↵1, θ>2,…,
θK)constants specifying the Dirichlet distribilities
is ↵
the
Dirichlet
where
↵=
, with
0, are
1 ↵
2 , · · · , ↵Kdistribution:
i
bution, and the ✓i satisfies 0 ✓i 1 and sum
to 1. K
K
Y
X
i 1 expressed in terms of the gamma
The normalizing
factor ZD(✓|↵)
for the
= Dirichlet
Z 1 (↵) can
✓i↵be
(
✓i 1)
(9)
Density:
function:
i=1
i=1 Q
Z Y
K
K
X
↵i 1
iP (↵i )
( are✓iconstants
1)d✓ = specifying
(10)
where ↵ = ↵1 , ↵2 ,Z(↵)
· · · , ↵=K , with✓↵
the Dirichlet distrii i > 0,
( i ↵i )
i=1 sum to 1.
bution, and the ✓i satisfies 0i=1
✓i 1 and
The normalizing factor Z for the Dirichlet can be expressed in terms of the gamma
2
different α
function: (α1, α2,…, αK) are constants
Q
Z Y
K
K
X
(↵
gives different probability
distribution over
θ.i )
Z(↵) =
✓i↵i 1 (
✓i 1)d✓ = iP
(10)
( i ↵i )
K=2 Beta distribution
i=1
1/2/13
Example: dice factories
Dice factories produce all kinds of dices: θ(1),
θ(2),…, θ(6)
A dice factory distinguish itself from the others
by parameters α=(α1,α2 ,α3 , α4 , α5 , α6)
The probability of producing a dice θ in the
factory α is determined by D(θ|α)
i=1
2
Parameter
estimation: MLE and MAP
MLE
Probabilistic model
Selecting a model
Estimating the model
parameters
(learning):
Yuzhen
Ye
from large sets of trusted examples
– A model can be anything from a simple distribution to
a complex stochastic grammar with many implicit
probability distributions
– Probabilistic distributions (Gaussian, binominal, etc)
– Probabilistic graphical models
•
•
•
•
Markov models
Hidden Markov models (HMM)
Bayesian models
Stochastic grammars
January 2, 2013
1
MLE
Given a set of data D (training set), find a model
with parameters θ with the maximal likelihood
P(D|θ)
Data model (learning)
– The parameters of the model have to be inferred from
the data
– MLE (maximum likelihood estimation) & MAP
(maximum a posteriori probability)
Model data (inference/sampling)
✓ˆM LE = arg max P (D|✓)
✓
2
(1)
MAP
✓ˆM AP
= arg max P (✓|D)
✓
P (D|✓)P (✓)
P (D)
= arg max P (D|✓)P (✓)
= arg max
✓
Parameter estimation: MLE and MAP
Yuzhen Ye
Example: a loaded dice
January 2, 2013
Loaded dice: to estimate parameters θ1, θ2,…, θ6,
based on N observations D=d1,d2,…dN
1
2
MLE
θi=ni / N, where ni is the occurrence of i outcome
(observed frequencies),
thePmaximum
✓ˆM LE = arg is
max
(D|✓)
✓
likelihood solution (BSA 11.5)
We can show that P (n|✓M LE ) > P (n|✓) for any ✓ 6= ✓M LE )
MAP
Learning from counts
✓ˆM AP
✓
(2)
Here the prior is used (as in Bayesian statistics):
When to use MLEP (D|✓)P (✓)
P (✓|D) =
P (D)
P (D|✓)P (✓)
= is P
A drawback of MLE
that it can give poor
(D|✓)P (✓)
✓ Pare
estimations when the data
scarce
(1)
(3)
– E.g, if you flip coin twice, you may only get heads,
then P(tail) = 0
It may be wiser to apply prior knowledge (e.g, we
assume P(tail) is close to 0.5)
– Use MAP instead
= arg max P (✓|D)
✓
P (D|✓)P (✓)
P (D)
= arg max P (D|✓)P (✓)
= arg max
✓
✓
(2)
1
Here the prior is used (as in Bayesian statistics):
P (✓|D) =
=
P (D|✓)P (✓)
P (D)
P (D|✓)P (✓)
P
✓ P (D|✓)P (✓)
5
(3)
Yuzhen Ye
January 2, 2013
1
1/2/13
MLE
✓ˆM LE = arg max P (D|✓)
(1)
✓
2
MAP
Parameter estimation: MLE and MAP
✓ˆM AP
= arg max P (✓|D)
✓
Yuzhen
P (D|✓)P
(✓) Ye
P (D)
= arg max PJanuary
(D|✓)P (✓) 2, 2013
= arg max
✓
MAP
(2)
✓
Example: two dices
Here the prior is used (as in Bayesian statistics):
Bayesian statistics
1
MLE
P (✓|D) =
=
2
P (D|✓)P (✓)
P (D)
(✓)
✓ˆPM(D|✓)P
LE = arg max P (D|✓)
P
✓
✓ P (D|✓)P (✓)
P(θParameter
) prior probability
estimation: MLE
P(θ|D) posterior probability
Yuzhen Ye
MAP
✓ˆM AP
MAP
Prior probabilities: fair dice 0.99; loaded dice: 0.01;
Loaded dice: P(6)=0.5, P(1)=…P(5)=0.1
(1)
Data: 3 consecutive ‘6’es:
(3)
and MAP
– P(loaded|3’6’s)=P(loaded)*[P(3’6’s|loaded)/P(3’6’s)] =
0.01*(0.53 / C)
– P(fair|3’6’s)=P(fair)*[P(3’6’s|fair)/P(3’6’s)] = 0.99 *
((1/6)3 / C)
– Model comparison by using likelihood ratio: P(loaded|
3’6’s) / P(fair|3’6’s) < 1
– So fair dice is more likely to generate the observation.
= arg
max P 2,
(✓|D)
January
2013
✓
P (D|✓)P (✓)
P (D)
= arg max P (D|✓)P (✓)
= arg max
1
✓
MLE
(2)
✓
✓ˆM LE = arg max P (D|✓)
(1)
✓
Here the prior
is used (as in Bayesian
statistics):
We can show that P (n|✓
1 M LE ) > P (n|✓) for any ✓ 6= ✓M LE
2
P (D|✓)P (✓)
P (D)
P (D|✓)P (✓)
= P
✓ˆM AP = arg
max P (✓|D)
✓ P (D|✓)P (✓)
P (✓|D) =
MAP
(3)
✓
P (D|✓)P (✓)
We see the posterior is itself like a Dirichlet distribution like the
but with di↵erent
= prior,
arg max
✓
P (D)
parameters. Now we have,
= arg max P (D|✓)P (✓)
(2)
Z(n + ↵)
✓
P (n) =
(6)
Use prior
knowledge
when the data is scarce
Z(↵)M
(n)
multinominal
distribution,
can use Dirichlet
withdi↵erent
parameters ↵
We seeFor
thethe
posterior
is itself like
a Dirichletwedistribution
like thedistribution
prior, but with
We then perform an integral to find
the Dirichlet
posterior mean
estimator, as prior for the
Use
distribution
parameters.
NowThe
we have,
as the prior.
the posterior
for the multinomial distribution with observations n:
Z
Z
+ ↵)
Y Z(n
multinomial
distribution:
(6)
k 1
d✓ P (n|✓)D(✓|↵)
(7)
✓i P=(n|✓)P
✓knk +↵(✓)
✓iP M E = ✓i D(✓|n + ↵)d✓ = Z 1 (n + ↵) P (n)
(n)
– Posterior P (✓|n) = k Z(↵)M
=
(3)
P
(n)
P
(n)
We then perform an integral to find the posterior mean estimator,
Learning from counts: including prior
– Posterior mean estimator
Z
Z
Z(n
+ ↵ + i ) distribution and the expression
Y n +↵for 1D(✓|↵) yields,
Inserting
multinominal
M=
E
1
✓iP M✓PEthe
(7)
= Z(n
✓i D(✓|n
✓k k(8)k d✓
i
+ ↵) + ↵)d✓ = Z (n + ↵) ✓i
Y
k
1
Z(n
+ ↵)
n
+↵
1
i
i
PP(✓|n)
✓i
=
D(✓|n + ↵)
M E = ni + ↵ i
✓i
= P (n)Z(↵)M (n)
(9)
+ ↵ + i ) P (n)Z(↵)M (n)
i
N + A P M E Z(n
✓i
=
(8)
(4)
– Equivalent
to add αi as pseudo-counts
to the
↵)
Here the prior is used We
(as don’t
in Bayesian
statistics):
1Z(n +functions
have to
get involved with
gamma
in order to finish the calculation,
observation ni (BSAP11.5)
ni + ↵ i
ME
because we know P
that
both
P
(✓|n)
and
D(✓|n
+
↵)
are
properly normalized probability
✓
=
(9)
i
(D|✓)P
(✓)about statistics
N
+
A
–
We
can
forget
and
use
P (✓|D) over
= ✓. This means that all the pre factors must cancel and,
distributions
(D)
Here the prior
is usedP(as
in Bayesian
pseudo-counts
in thestatistics):
parameter estimation!
P (D|✓)P (✓) P (✓|n) = D(✓|n + ↵)
(5)
= P
(10)
(✓) = P (D|✓)P (✓)
✓ P (D|✓)P
P (✓|D)
P (D)
P (D|✓)P (✓)
= P 1
(10)
✓ P (D|✓)P (✓)
Sampling
Probabilistic model with parameter θ P(x|
θ) for event x;
Sampling: generate a large set of events xi
with probability P(xi| θ);
Random number generator ( function rand()
picks a number randomly from the interval
[0,1) with the uniform density;
Sampling from a probabilistic model
transforming P(xi| θ) to a uniform distribution
– For a finite set X (xi∈X), find i s.t. P(x1)+…+P(xi-1)
< rand(0,1) < P(x1)+…+P(xi-1) + P(xi)
Entropy
Mutual information
Probabilities distributions P(xi) over K events
Measure of independence of two random
variable X and Y
P(X|Y)=P(X), X and Y are independent
P(X,Y)/P(X)P(Y)=1
M(X;Y)=∑x,y P(x,y)log[P(x,y)/P(x)P(y)]
H(x)=-∑ P(xi) log P(xi)
– Maximized for uniform distribution P(xi)=1/K
– A measure of average uncertainty
– 0 independent
2
2
6