Download handout - Indiana University Computer Science Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
1/2/13
I529: Machine Learning in Bioinformatics (Spring 2013)
Definitions
  Probabilistic models
–  A model means a system that simulates the object under
consideration
–  A probabilistic model is one that produces different
outcomes with different probabilities (BSA)
Probabilistic models
Yuzhen Ye
School of Informatics and Computing
Indiana University, Bloomington
Spring 2013
Why probabilistic models
Probability
  The biological system being analyzed is
stochastic
  Or noisy
  Or completely deterministic, but because a
number of hidden variables effecting its behavior
are unknown, the observed data might be best
explained with a probabilistic model
  Experiment: a procedure involving chance that
leads to different results
  Outcome: the result of a single trial of an
experiment
  Event: one or more outcomes of an experiment
  Probability: the measure of how likely an event
is
–  Between 0 (will not occur) and 1 (will occur)
Templates for equations
Yuzhen Ye
January 1, 2013
Example: a fair 6-sided dice
Random variable
1
Equation for slides
  Random variables
are functions that assign a
L
Y
(nisi /N )
unique
number
possible outcome of an
R(S|P
) = to each
experiment i=1 bsi
  An example
  Outcome: The possible outcomes of this
experiment are 1, 2, 3, 4, 5 and 6
  Events: 1; 6; even
  Probability: outcomes are equally likely to occur
X=
–  Experiment: tossing a coin
–  Outcome space: {heads, tails}
⇢
– 
1 if heads
X=
0 if tails
–  P(A) = The Number Of Ways Event A Can Occur / The Total
Number Of Possible Outcomes
–  P(1)=P(6)=1/6; P(even)=3/6=1/2;
2
(1)
(2)
–  More
exactly,
X is a discrete random variable
Mutual Information
and
others
–  P(X=1)=1/2, P(X=0)=1/2
X
fx y
M Iij =
fxi yj log2 i j
fxi fxj
xy
(3)
i j
2
2/5
3
3/5
M Iij = log2
+ log2
=0
5
1 ⇤ 2/5 5
1 ⇤ 3/5
k( i ,
k( i ,
j)
=
j)
= hLi e
1
X
n=m=0
k( i ,
j)
=
1
X
n=0
n n
n!n!
Hi
n m
n!m!
hLi Hin LTi , Lj Hjn LTj i +
LTi , Lj e
Hj
LTj i
(5)
hLi Hin LTi , Lj Hjm LTj i
1
X
n=m=0,n6=m
n m
n!n!
(4)
hLi Hin LTi , Lj Hjm LTj i
(6)
(7)
1
1/2/13
Templates for equations
Yuzhen Ye
January 1, 2013
Probability distribution
1
Probability
Equation
for slidesmass function (pmf)
L
Y (n
  A probability mass function
isisai /N
function
that
)
R(S|P ) =
gives the probability thati=1
a discrete
random
bs i
variable is exactly equal to some value; it is often
the primary means ⇢
of defining a discrete
1 if heads
X=
probability distribution
0 if tails
  An example
  Probability distribution: the assignment of a
probability P(x) to each outcome x.
  A fair dice: outcomes are equally likely to occur
 the probability distribution over the all six
outcomes
P(x)=1/6,
or 6.
Templates
forx=1,2,3,4,5
equations
Templates
for equations
  A loaded dice: outcomes are unequally likely to
occur  the probability
over the all six
Yuzhen Yuzhen
Yedistribution
Ye
outcomes
P(x)=f(x),
Templates
forx=1,2,3,4,5
equationsor 6, but ∑f(x)=1.
P (X) =
JanuaryJanuary
2, 2013 2, 2013
Yuzhen Ye
1
1
Equation
for slides
January 2, 2013
1 Equation
for slides
Equation for slides
random random
variable variable
L
L
Y
Y
(nis /N
)(nisi /N )
R(S|P ) =
R(S|P ) =i
bs i
bs i
i=1
2
(1)
k( i ,
(2)
(2)
(3)
= hLi e
The probability of1 a DNA sequence
  Event: Observing a DNA sequence S=s1s2…sn:
si ∈ {A,C,G,T};
  Random sequence model (or Independent and
identically-distributed, i.i.d. model): si occurs at
random with the probability P(si), independent
of all other
residues in the sequence;
n
  P(S)= ∏ P (si )
i =1
  This model will be used as a background
model (or called a null hypothesis).
Hi
LTi , Lj e
Hj
LTj i
(3)
(4)
n m
j) =
(2)
1
(3)
(3)
(4)
(4)
(4)
1
X
1
n n
n m
X
n joint
T
n T
nY T
m T
–  P(X,Y)
probability
(distribution)
Yuzhen Yeof XhLand
hLi H
i H i L i , Lj H j L j i
i L i , Lj H j L j i +
n!n!
n!n!
n=0 –  P(X,Y)=P(X|Y)P(Y)=P(Y|X)P(X)
n=m=0,n6=m
January 2, 2013
–  P(X|Y)=P(X), X and Y are independent
(5)
–  P(y): y=D1 or D2
L
Y
(nisi /N )
–  P(i, D1)=P(i| D1)P(D1) R(S|P ) =
bs i
i=1
–  P(i| D1)=P(i| D2), independent events
(5)
(1)
random variable
X=
(5)
(5)
(6)
  Example: experiment 11 (selecting a dice),
experiment
2 (rolling
Equation for
slidesthe selected dice)
1
P (x, y)1
P (x, y)
=R 1
P (y)
P (x, y)dy
(2)
  Two experiments (random variables) X and Y
k( i ,
P (X = x) =
Z 0b
Z b
P (a  XP
f (x)dx
(ab)=X  b)
=
f (x)dx
Z ab
a
P (a  X  b) =
f (x)dx
marginalmarginal
probability
probability
a
  Cumulative distribution
(cdf)
Z Z x function
Z
P (x)
==P (x)
P (x,
=fy)dy
P (x, y)dy
P (X
 x)
(t)d(t)
P (x|y) ⌘
j)
1
X
  A pdf must be integrated over an interval to yield
conditional probability
f xi y j
fxi fxj
Joint k(
probability
hLi Hin LTi , Lj Hjm LTj i
i, j ) =
n!m!
n=m=0
Templates for equations
probability densityafunction
probability, since
P (X = a)
= 0= x) = 0
P (X
Z
P (x, y) P (x,Py)(x, y) P (x, y)
(x)
= ⌘
P (x,
P (x|y)P ⌘
= Ry)dy = R
P (x|y)
P (y)
P (x, y)dyP (x, y)dy
P (y)
fxi yj log2
2
2/5
3
3/5
M Iij = log2
+ log2
=0
5
1 ⇤ 2/5 5
1 ⇤ 3/5
(1)
Probability density function (pdf)
conditional
probability
marginal
probability
conditional
probability
X
xi y j
i=1
L
Y
(nis /N )
R(S|P⇢) =
⇢ i
1 ifi=1heads
1bsifi heads
X=
X=
0 if tails 0 if tails
random variable
⇢
1 if heads
probability
mass function
probability
mass function
X=
0 if tails
8
8
probability mass function
< 1/2 heads
< 1/2 heads
  Probability
density
functions
(pdf) are for
1/2
P (X)
=8
P (X)
=tails1/2 tails
:
:
0 than
others
heads
0 others random
< 1/2
continuous rather
discrete
P (X)
variables;
f(x)= 1/2 tails
probability
density function
probability
density function: 0 others
8
< 1/2 heads
1/2 tails
:
0 others
Mutual Information and others
M Iij =
(1)
(1)
⇢
1 if heads
0 if tails
probability mass function
P (X) =
(6)
probability density function
Marginal
probability
8
< 1/2 heads
1/2 tails
:
0 others
P (X
= a) = 0 (the
  The distribution of the marginal
variables
marginal distribution) is obtained by marginalizing
Z b
over the distribution of the
variables being
discarded
P (a  X  b) =
f (x)dx
(so the discarded variables are marginalized
out)
a
marginal P(X|Y)P(Y)
probability
  P(X)=∑
Y
Z
  Example: experiment 1 (selecting a dice),
P (x) = P (x, y)dy
experiment 2 (rolling the selected dice)
(2)
(3)
(4)
–  P(y): y=D1 probability
or D2
conditional
–  P(i) =P(i| D1)P(D1)+P(i| D2)P(D2)
–  P(i| D1)=P(i| D2), independent events
P P(i|
(x|y)
–  P(i)= P(i| D1)(P(D1)+P(D2))=
D1)⌘
P (x, y)
P (x, y)
=R
P (y)
P (x, y)dy
(5)
1
2
1 if heads
0 if tails
X=
probability mass function
P (X) =
probability density function
8
< 1/2 heads
1/2 tails
:
0 others
1/2/13
P (X = x) = 0
P (a  X  b) =
P (X  x) =
Z
Z
(2)
b
f (x)dx
x
f (t)d(t)
1
Conditional probability
marginal probability
  Conditioning the joint distribution on a particular
Z
observation
P (x) = P (x, y)dy
  Conditional probability P(X|Y): the measure of how
conditional probability
likely an event X happens under the condition Y;
P (x|y) ⌘
(3)
a
P (x, y)
P (x, y)
=R
P (y)
P (x, y)dy
(4)
Probability models
(5)
  A system that produces different outcomes with
different probabilities.
(6)
  It can simulate a class of objects (events),
assigning each an associated probability.
–  Example: two dices D1, D2
  Simple objects (processes)  probability
distributions
•  P(i|D1) probability for
1 picking i using dicer D1
•  P(i|D2) probability for picking i using dicer D2
Binomial distribution
Typical probability distributions
 
 
 
 
 
Binomial distribution
Gaussian distribution
Multinomial distribution
Poisson distribution
Dirichlet distribution
  An experiment with binary outcomes: 0 or 1;
  Probability distribution of a single experiment:
P(‘1’)=p and P(‘0’) = 1-p;
  Probability distribution of N tries of the same
experiment
⎛ N ⎞
k
N −k
  Bi(k ‘1’s out of N tries) ~ ⎜⎜⎝ k ⎟⎟⎠ p (1 − p)
Probabilitydistributions
distributions
Probability
YuzhenYeYe
Yuzhen
January2, 2,2013
2013
January
CommonDistributions
Distributions
Gaussian distribution
1 1 Common
Gaussian distribution
  When N -> ∞, Bi -> Gaussian distribution
Gaussian
Distribution.
• •Gaussian
Distribution.
  The
Gaussian
(normal)
distribution
is aprobability distribution that
The
Gaussian
normal)
distribution
a continuous
The
Gaussian
(or(or
normal)
distribution
is is
a continuous
probability distribution that
continuous
probability
distribution
with probability
has
a
bell-shaped
probability
density
function
as:
has a bell-shaped probability density function as:
density function defined as:
1 x µ 2
1
2
f (x;
) = p1p e e12 ( x2 ( µ )2 )
f (x;
µ,µ,
↵2↵) =
(1)(1)
2⇡2⇡
µ: mean (expectation); σ :2 variance (σ: the standard
where
the
mean
or
expectation,
and 2 is is
variance
the
standard
deviation).
where
µµ
is is
the
mean
or
expectation,
and
variance
( ( is is
the
standard
deviation).
derivation)
If
we
define
a
new
variable
u=
(x µ)/
µ)/, we
, we
have,
If we define a new variable
u=
(x
have,
If we define
a new
variable
u=(x-µ)/σ
2
u2 /2
1 1 eu2 /2
f (x)
f (x)
⇠ ⇠p p2⇡
e
2⇡
(2)(2)
Figure from Wikipedia
standard normal distribution when µ = 0 and σ2 =1
Binominal
Distribution.
• •Binominal
Distribution.
Thebinomial
binomialdistribution
distributiongives
givesthe
thediscrete
discreteprobability
probabilitydistribution
distributionof ofobtaining
obtaining
The
exactly
successes
out
Bernoulli
trials
(where
the
result
each
Bernoulli
trial
exactly
nn
successes
out
of of
N NBernoulli
trials
(where
the
result
of of
each
Bernoulli
trial
true
with
probability
p and
false
with
probability
q=
is is
true
with
probability
of of
p and
false
with
probability
of of
q=
1 1 p):p):
N
n (N n)
n)
P (n|N
) =N npnpq (N
q
P (n|N
)=
n
(3)(3)
Theprobability
probabilityof ofobtaining
obtainingmore
moresuccesses
successesthan
thanthe
then nobserved
observedin ina abinomial
binomialdisdisThe
tribution
tribution
is:is:
N
NX
X
N
k (N k)
q k)
N k kp (N
P P= =
(4)(4)
k p q
k=n+1
k=n+1
3
1/2/13
events occur with a known average rate ( ) and independently of the time since the
last event (i.e., it is a Poisson process):
p (n) =
n
e
(5)
n!
Multinomial distribution
I receive 10 emails everyday on average, what’s my chance of receiving
Example: a fair dice
Example 1:
no email today?
Example 2:   An experiment with K independent outcomes
  Probability: outcomes (1,2,…,6) are equally
likely to occur
with probabilities θi, i =1,…,K, ∑θi =1.
• Multinominal Distribution.
  Probability
distributionoutcomes
of N tries
of the
same
An
with K independent
with
probabilities
✓i for i = 1, , K,
P experiment experiment,
getting
nn
i occurrences of outcome i, ∑ni
✓
=
1.
The
probability
of
getting
occurrences
of
outcome
i is giving by (n =
i
i
=N (n={ni}).
{ni }),
K
Y
P (n|✓) = M 1 (n)
✓ini
(6)
  Probability of rolling 1 dozen times (12) and
getting each outcome twice:
– 
12
()
12! 1
26 6
~3.4×10-3
i=1
M (n) =
Q
ni !
n1 !n2 ! · · · nK !
P
= Pi
( k nk )!
( k nk )!
(7)
• Gamma Distribution.
The gamma distribution has been used to model the rate of evolution at di↵erent
sites in DNA sequences. The gamma distribution is conjugate to the Poisson which
gives the probability of seeing n events over some interval, when there is a probability
p of an individual event occurring in that interval. Since the number of events in an
interval is a rate, the gamma distribution is appropriate for modeling probabilities of
rates.
• Dirichlet Distribution In Bayesian statistics, we need distributions over probability
parameters to use as prior distributions. A natural choice for a density over probaa loaded dice
bilities is theExample:
Dirichlet distribution:
2
2.1
€
Poisson distribution
  Poisson gives the probability of seeing n events
over some interval, when there is a probability p
of an individual event occurring in that period.
  Probability: outcomes (1,2,…,6) are
unequally likely to occur: P(6)=0.5,
Logistic Linear
Regression
P(1)=P(2)=…=P(5)=0.1
Fitting
Liker other regressions, logistic regression makes use of one or more predictor variables that
may be either continuous or categorical data. Also, like other linear regression models, the
  Probability of rolling 1 dozen times (12) and
expected value (average value) of the response variable is fit to the predictors—the expected
getting each outcome twice:
value of a Bernoulli distribution
the probability of a case. But, unlike ordinary
2 is simply
10
12!
-4
–  2regression
6 (0.5) × (0.1)
linear regression, logistic
is used~1.87×10
for predicting binary outcomes (Bernoulli trials)
instead of continuous values. Given this di↵erence, it is necessary that logistic regression
take the natural logarithm of the odds (referred to as the logit or log-odds) to create a
€
2
Poisson distribution for sequencing coverage
modeling
Poisson distribution
C
Assuming uniform distribution of reads:
Length of genomic segment: L
Number of reads:
n
Length of each read:
l
Coverage λ = n l / L
How much coverage is enough (or what is sufficient oversampling)?
Lander-Waterman model: P(x) = (λx * e-λ ) / x!
P(x=0) = e-λ
where λ is coverage
4
K
Y
P (n|✓) = M 1 (n)
✓ini
(6)
• Gamma Distribution.
i=1
The gamma distribution has been used to model theQrate of evolution at di↵erent
ni !
n1 !n
sites in DNA sequences. The gamma
distribution
to the Poisson which
2 ! · · · nK ! is conjugate
P
M (n) =
= Pi
(7)
( k nwhen
gives the probability of seeing n events( over
interval,
there is a probability
k )!
k )!
k nsome
p of an individual event occurring in that interval. Since the number of events in an
•interval
Gamma
is Distribution.
a rate, the gamma distribution is appropriate for modeling probabilities of
The gamma distribution has been used to model the rate of evolution at di↵erent
rates.
1 ↵
sites in DNA sequences. The gamma
is conjugate to the Poisson which
e x x↵ distribution
g(x,of↵,seeing
) = n events over fsome
or 0 interval,
< x, ↵, when
< 1 there is a probability
(8)
gives the probability
(↵)
p of an individual event occurring in that interval. Since the number of events in an
• Dirichlet
Bayesian
statistics,
need distributions
overprobabilities
probabilityof
interval Distribution
is a rate, the In
gamma
distribution
is we
appropriate
for modeling
parameters
to use as prior distributions. A natural choice for a density over probarates.
x
↵
1
↵
e
x
bilities is the Dirichlet distribution:
g(x, ↵, ) =
f or 0 < x, ↵, < 1
(8)
(↵)K
K
Y ↵ 1 X
Dirichlet Indistribution
= Z 1 (↵)
✓i i we (need✓idistributions
1)
(9)
• Dirichlet DistributionD(✓|↵)
Bayesian
statistics,
over probability
i=1
i=1
parameters to use as prior distributions.
A natural
choice for a density over proba  ,Outcomes:
θ=(θ↵1, θ>2,…,
θK)constants specifying the Dirichlet distribilities
is ↵
the
Dirichlet
where
↵=
, with
0, are
1 ↵
2 , · · · , ↵Kdistribution:
i
bution, and the ✓i satisfies 0  ✓i  1 and sum
to 1. K
K
Y
X
i 1 expressed in terms of the gamma
The normalizing
factor ZD(✓|↵)
for the
= Dirichlet
Z 1 (↵) can
✓i↵be
(
✓i 1)
(9)
  Density:
function:
i=1
i=1 Q
Z Y
K
K
X
↵i 1
iP (↵i )
( are✓iconstants
1)d✓ = specifying
(10)
where ↵ = ↵1 , ↵2 ,Z(↵)
· · · , ↵=K , with✓↵
the Dirichlet distrii i > 0,
( i ↵i )
i=1 sum to 1.
bution, and the ✓i satisfies 0i=1
 ✓i  1 and
The normalizing factor Z for the Dirichlet can be expressed in terms of the gamma
2
 different α
function:   (α1, α2,…, αK) are constants
Q
Z Y
K
K
X
(↵
gives different probability
distribution over
θ.i )
Z(↵) =
✓i↵i 1 (
✓i 1)d✓ = iP
(10)
( i ↵i )
  K=2  Beta distribution
i=1
1/2/13
Example: dice factories
  Dice factories produce all kinds of dices: θ(1),
θ(2),…, θ(6)
  A dice factory distinguish itself from the others
by parameters α=(α1,α2 ,α3 , α4 , α5 , α6)
  The probability of producing a dice θ in the
factory α is determined by D(θ|α)
i=1
2
Parameter
estimation: MLE and MAP
MLE
Probabilistic model
  Selecting a model
  Estimating the model
parameters
(learning):
Yuzhen
Ye
from large sets of trusted examples
–  A model can be anything from a simple distribution to
a complex stochastic grammar with many implicit
probability distributions
–  Probabilistic distributions (Gaussian, binominal, etc)
–  Probabilistic graphical models
• 
• 
• 
• 
Markov models
Hidden Markov models (HMM)
Bayesian models
Stochastic grammars
January 2, 2013
1
MLE
  Given a set of data D (training set), find a model
with parameters θ with the maximal likelihood
P(D|θ)
  Data  model (learning)
–  The parameters of the model have to be inferred from
the data
–  MLE (maximum likelihood estimation) & MAP
(maximum a posteriori probability)
  Model  data (inference/sampling)
✓ˆM LE = arg max P (D|✓)
✓
2
(1)
MAP
✓ˆM AP
= arg max P (✓|D)
✓
P (D|✓)P (✓)
P (D)
= arg max P (D|✓)P (✓)
= arg max
✓
Parameter estimation: MLE and MAP
Yuzhen Ye
Example: a loaded dice
January 2, 2013
  Loaded dice: to estimate parameters θ1, θ2,…, θ6,
based on N observations D=d1,d2,…dN
1
2
MLE
  θi=ni / N, where ni is the occurrence of i outcome
(observed frequencies),
thePmaximum
✓ˆM LE = arg is
max
(D|✓)
✓
likelihood solution (BSA 11.5)
We can show that P (n|✓M LE ) > P (n|✓) for any ✓ 6= ✓M LE )
MAP
  Learning from counts
✓ˆM AP
✓
(2)
Here the prior is used (as in Bayesian statistics):
When to use MLEP (D|✓)P (✓)
P (✓|D) =
P (D)
P (D|✓)P (✓)
= is P
  A drawback of MLE
that it can give poor
(D|✓)P (✓)
✓ Pare
estimations when the data
scarce
(1)
(3)
–  E.g, if you flip coin twice, you may only get heads,
then P(tail) = 0
  It may be wiser to apply prior knowledge (e.g, we
assume P(tail) is close to 0.5)
–  Use MAP instead
= arg max P (✓|D)
✓
P (D|✓)P (✓)
P (D)
= arg max P (D|✓)P (✓)
= arg max
✓
✓
(2)
1
Here the prior is used (as in Bayesian statistics):
P (✓|D) =
=
P (D|✓)P (✓)
P (D)
P (D|✓)P (✓)
P
✓ P (D|✓)P (✓)
5
(3)
Yuzhen Ye
January 2, 2013
1
1/2/13
MLE
✓ˆM LE = arg max P (D|✓)
(1)
✓
2
MAP
Parameter estimation: MLE and MAP
✓ˆM AP
= arg max P (✓|D)
✓
Yuzhen
P (D|✓)P
(✓) Ye
P (D)
= arg max PJanuary
(D|✓)P (✓) 2, 2013
= arg max
✓
MAP
(2)
✓
Example: two dices
Here the prior is used (as in Bayesian statistics):
  Bayesian statistics
1
MLE
P (✓|D) =
=
2
P (D|✓)P (✓)
P (D)
(✓)
✓ˆPM(D|✓)P
LE = arg max P (D|✓)
P
✓
✓ P (D|✓)P (✓)
P(θParameter
)  prior probability
estimation: MLE
P(θ|D)  posterior probability
Yuzhen Ye
MAP
✓ˆM AP
  MAP
  Prior probabilities: fair dice 0.99; loaded dice: 0.01;
  Loaded dice: P(6)=0.5, P(1)=…P(5)=0.1
(1)
  Data: 3 consecutive ‘6’es:
(3)
and MAP
–  P(loaded|3’6’s)=P(loaded)*[P(3’6’s|loaded)/P(3’6’s)] =
0.01*(0.53 / C)
–  P(fair|3’6’s)=P(fair)*[P(3’6’s|fair)/P(3’6’s)] = 0.99 *
((1/6)3 / C)
–  Model comparison by using likelihood ratio: P(loaded|
3’6’s) / P(fair|3’6’s) < 1
–  So fair dice is more likely to generate the observation.
= arg
max P 2,
(✓|D)
January
2013
✓
P (D|✓)P (✓)
P (D)
= arg max P (D|✓)P (✓)
= arg max
1
✓
MLE
(2)
✓
✓ˆM LE = arg max P (D|✓)
(1)
✓
Here the prior
is used (as in Bayesian
statistics):
We can show that P (n|✓
1 M LE ) > P (n|✓) for any ✓ 6= ✓M LE
2
P (D|✓)P (✓)
P (D)
P (D|✓)P (✓)
= P
✓ˆM AP = arg
max P (✓|D)
✓ P (D|✓)P (✓)
P (✓|D) =
MAP
(3)
✓
P (D|✓)P (✓)
We see the posterior is itself like a Dirichlet distribution like the
but with di↵erent
= prior,
arg max
✓
P (D)
parameters. Now we have,
= arg max P (D|✓)P (✓)
(2)
Z(n + ↵)
✓
P (n) =
(6)
  Use prior
knowledge
when the data is scarce
Z(↵)M
(n)
multinominal
distribution,
can use Dirichlet
withdi↵erent
parameters ↵
We seeFor
thethe
posterior
is itself like
a Dirichletwedistribution
like thedistribution
prior, but with
We then perform an integral to find
the Dirichlet
posterior mean
estimator, as prior for the
Use
distribution
parameters.
NowThe
we have,
as the prior.
the posterior
for the multinomial distribution with observations n:
Z
Z
+ ↵)
Y Z(n
multinomial
distribution:
(6)
k 1
d✓ P (n|✓)D(✓|↵)
(7)
✓i P=(n|✓)P
✓knk +↵(✓)
✓iP M E = ✓i D(✓|n + ↵)d✓ = Z 1 (n + ↵) P (n)
(n)
–  Posterior P (✓|n) = k Z(↵)M
=
(3)
P
(n)
P
(n)
We then perform an integral to find the posterior mean estimator,
Learning from counts: including prior
–  Posterior mean estimator
Z
Z
Z(n
+ ↵ + i ) distribution and the expression
Y n +↵for 1D(✓|↵) yields,
Inserting
multinominal
M=
E
1
✓iP M✓PEthe
(7)
= Z(n
✓i D(✓|n
✓k k(8)k d✓
i
+ ↵) + ↵)d✓ = Z (n + ↵) ✓i
Y
k
1
Z(n
+ ↵)
n
+↵
1
i
i
PP(✓|n)
✓i
=
D(✓|n + ↵)
M E = ni + ↵ i
✓i
= P (n)Z(↵)M (n)
(9)
+ ↵ + i ) P (n)Z(↵)M (n)
i
N + A P M E Z(n
✓i
=
(8)
(4)
–  Equivalent
to add αi as pseudo-counts
to the
↵)
Here the prior is used We
(as don’t
in Bayesian
statistics):
1Z(n +functions
have to
get involved with
gamma
in order to finish the calculation,
observation ni (BSAP11.5)
ni + ↵ i
ME
because we know P
that
both
P
(✓|n)
and
D(✓|n
+
↵)
are
properly normalized probability
✓
=
(9)
i
(D|✓)P
(✓)about statistics
N
+
A
– 
We
can
forget
and
use
P (✓|D) over
= ✓. This means that all the pre factors must cancel and,
distributions
(D)
Here the prior
is usedP(as
in Bayesian
pseudo-counts
in thestatistics):
parameter estimation!
P (D|✓)P (✓) P (✓|n) = D(✓|n + ↵)
(5)
= P
(10)
(✓) = P (D|✓)P (✓)
✓ P (D|✓)P
P (✓|D)
P (D)
P (D|✓)P (✓)
= P 1
(10)
✓ P (D|✓)P (✓)
Sampling
  Probabilistic model with parameter θ  P(x|
θ) for event x;
  Sampling: generate a large set of events xi
with probability P(xi| θ);
  Random number generator ( function rand()
picks a number randomly from the interval
[0,1) with the uniform density;
  Sampling from a probabilistic model 
transforming P(xi| θ) to a uniform distribution
–  For a finite set X (xi∈X), find i s.t. P(x1)+…+P(xi-1)
< rand(0,1) < P(x1)+…+P(xi-1) + P(xi)
Entropy
Mutual information
  Probabilities distributions P(xi) over K events
  Measure of independence of two random
variable X and Y
  P(X|Y)=P(X), X and Y are independent 
P(X,Y)/P(X)P(Y)=1
  M(X;Y)=∑x,y P(x,y)log[P(x,y)/P(x)P(y)]
  H(x)=-∑ P(xi) log P(xi)
–  Maximized for uniform distribution P(xi)=1/K
–  A measure of average uncertainty
–  0  independent
2
2
6