Download Review session - UBC Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Machine learning: a probabilistic approach
CS340: Machine Learning
Review for midterm
• We want to make models of data so we can find patterns and
predict the future.
• We will use probability theory to represent our uncertainty about
what the right model is; this will induce uncertainty in our
predictions.
Kevin Murphy
• We will update our beliefs about the model as we acquire data (this
is called “learning”); as we get more data, the uncertainty in our
predictions will decrease.
Outline
Probability basics
• Probability basics
• Chain (product) rule
• Parameter estimation
• Marginalization (sum) rule
• Probability density functions
• Prediction
• Multivariate probability models
• Conditional probability models
• Information theory
p(A, B) = p(A)p(B|A)
p(B) =
X
(1)
p(A = a, B)
(2)
p(A, B)
p(B)
(3)
a
• Hence we derive Bayes rule
p(A|B) =
PMF
PDF
• A probability mass function (PMF) is defined on a discrete random
variable X ∈ {1, . . . , K} and satisfies
• A probability density function (PDF) is defined on a continuous
random variable X ∈ IR or X ∈ IR+ or X ∈ [0, 1] etc and satisfies
Z
1 =
p(X = x)dx
(6)
1 =
K
X
P (X = x)
(4)
X
0 ≤ p(X = x)
x=1
0 ≤ P (X = x) ≤ 1
(5)
• This is just a histogram!
(7)
• Note that it is possible that p(X = x) > 1, since multiplied by dx.
• P (x ≤ X ≤ x + dx) ≈ p(x)dx
• Statistics books often write fX (x) for pdfs and use P () for pmf’s.
Moments of a distribution
• Mean (first central moment)
µ = E[X] =
X
xp(x)
CDFs
(8)
x
The cumulative distribution function is defined
as
Z x
F (x) = P (X ≤ x) =
f (x)dx
Standard Normal
Gaussian cdf
0.4
x
1
0.9
0.35
0.8
0.3
0.7
0.25
0.6
p(x)
(9)
p(x)
• Variance (second central moment)
X
σ 2 = Var [X] =
(x − µ)2p(x) = E[X 2] − µ2
(10)
−∞
0.2
0.5
0.4
0.15
0.3
0.1
0.2
0.05
0
−3
0.1
−2
−1
0
x
1
2
3
0
−3
−2
−1
0
x
1
2
3
Quantiles of a distribution
P (a ≤ X ≤ b) = FX (b) − FX (a)
eg if Z ∼ N (0, 1), FZ (z) = Φ(z). Then
p(−1.96 ≤ Z ≤ 1.96) = Φ(1.96) − Φ(−1.96)
= 0.975 − 0.025 = 0.95
Hence X ∼ N (µ, σ),
p(−1.96σ < X − µ < 1.96σ) = 1 − 2 × 0.025 = 0.95
Hence the interval µ ± 1.96σ contains 0.95 mass.
Outline
(11)
(12)
(13)
(14)
• Probability basics
• Probability density functions
• Parameter estimation
• Prediction
• Multivariate probability models
• Conditional probability models
• Information theory
Bernoulli distribution
• X ∈ {0, 1}, p(X = 1) = θ, p(X = 0) = 1 − θ
• X ∼ Be(θ)
• E[X] = θ
√
Multinomial distribution
• X ∈ {1, . . . , K}, p(X = k) = θk , X ∼ M u(θ)
• 0 ≤ θk ≤ 1
P
• K
k=1 θk = 1
• E[Xi] = θi
Beta distribution
(Univariate) Gaussian distribution
• X ∈ IR, X ∼ N (µ, σ), µ ∈ IR, σ ∈ IR+
1
− 1 (x−µ)2
def
e 2σ2
N (x|µ, σ) = √
2πσ 2
• E[X] = µ
• X ∈ [0, 1], X ∼ Be(α1, α0), 0 ≤ α0, α1
(15)
1
• E[X] = α α+α
1
0
Be(X|α1, α0) ∝ [X α1−1(1 − X)α0−1]
a=0.10, b=0.10
Standard Normal
a=1.00, b=1.00
4
2
3
1.5
0.4
2
1
1
0.5
0.35
0.3
0
p(x)
0.25
2
0.15
1.5
0.1
0.5
1
0
0
0.5
1
a=8.00, b=4.00
3
2
1
0.05
0
−3
0
a=2.00, b=3.00
0.2
1
0.5
−2
−1
0
x
1
2
0
3
Dirichlet distribution
• X ∈ [0, 1]K , X ∼ Dir(α1, . . . , αK ), 0 ≤ αk
P
• E[Xi] = ααi , where α = k αk .
Y α −1
Dir(X|α1, . . . , αK ) ∝
Xk k
k
0
0.5
1
0
0
0.5
Outline
• Probability basics
√
• Probability density functions
√
• Parameter estimation
• Prediction
• Multivariate probability models
• Conditional probability models
• Information theory
1
Parameter estimation: MLE
• Likelihood of iid data D = {xn}N
n=1
L(θ) = p(D|θ) =
Y
n
• Log likelihood
MLE for Bernoullis
p(xn|θ)
(16)
p(Xn|θ) = θXn (1 − θ)1−Xn
X
X
`(θ) =
Xn log θ +
(1 − Xn) log(1 − θ)
n
`(θ) = log p(D|θ) =
X
n
log p(xn|θ)
θ̂ = arg max L(θ) = arg max `(θ)
hence
θ
∂L
=0
∂θi
(17)
(18)
dL
N
N0
= 1−
=0
dθ
θ
1−θ
N1
θ̂ =
N1 + N0
`(θ) =
(26)
I(xn = k) log θk
(27)
n
X k
Nk log θk
=
(28)
k
l˜ =
X
k

Nk log θk + λ 1 −
∂ l˜
N
= k −λ=0
∂θk
θk
Nk
θ̂k =
N
(24)
(25)
Parameter estimation: Bayesian
Y I(X =k)
θk n
k X
X
(22)
(23)
(19)
MLE for Multinomials
p(Xn|θ) =
n
= N1 log θ + N0 log(1 − θ)
X
N1 =
I(Xn = 1)
n
• MLE = Maximum likelihood estimate
θ
(20)
(21)
X
k

θk 
(29)
(30)
(31)
p(θ|D) ∝ p(D|θ)p(θ)
Y
= [ p(xn|θ)]p(θ)
(32)
(33)
n
We say p(θ) is conjugate to p(D|θ) if p(θ|D) in same family as p(θ).
This allows recursive (sequential) updating.
MLE is a point estimate. Bayes computes full posterior, natural model
of uncertainty (e.g., posterior variance, entropy, credible intervals, etc)
Parameter estimation: MAP
For many models, as |D|→∞,
where
Bayes estimate for Bernoulli
p(θ|D)→δ(θ − θ̂M AP )→δ(θ − θ̂M LE )
(34)
θ̂M AP = arg max p(D|θ)p(θ)
(35)
θ
MAP (maximum a posteriori) estimate is a posterior mode. Also called
a penalized likelihood estimate since
θ̂M AP = arg max log p(D|θ) − λc(θ)
θ
Xn ∼ Be(θ)
θ ∼ Beta(α1, α0)
Y
p(θ|D) ∝ [ θXn (1 − θ)1−Xn ]θα1−1(1 − θ)α0−1
n
N
= θ 1 (1 − θ)N0 θα1−1(1 − θ)α0−1
∝ Beta(α1 + N1, α0 + N0)
(36)
where c(θ) = − log p(θ) is a penalty or regularizer, and λ is the
strenghth of the regularizer.
Bayes estimate for Multinomial
Xn ∼ M u(θ)
θ ∼ Dir(α1, . . . , αK )
Y Y I(X =k) Y α −1
θk n ]
θk k
p(θ|D) ∝ [
n
Y Nk α −1
θk k θk k
=
Outline
(42)
(43)
√
√
(44)
• Probability density functions
√
• Parameter estimation
(45)
• Multivariate probability models
(46)
• Information theory
k
k
∝ Dir(α1 + N1, . . . , αK + NK )
• Probability basics
• Prediction
• Conditional probability models
(37)
(38)
(39)
(40)
(41)
Predicting the future
Predictive for Bernoullis
Posterior predictive density gotten by Bayesian model averaging
Z
p(X|D) =
p(X|θ)p(θ|D)dθ
(47)
Plug-in principle: if p(θ|D) ≈ δ(θ − θ̂), then
p(X|D) ≈ p(X|θ̂)
(48)
p(θ|D) = Beta(α1 + N1, α0 + N0)
def
0 0
= Beta(α
1 , α0 )
Z
p(X = 1|D) =
p(X = 1|θ)p(θ|D)dθ
Z
=
θp(θ|D)dθ
= E[θ|D]
α0
= 0 1 0
α1 + α0
Cross validation
We can use CV to find the model with best predictive performance.
Outline
• Probability basics
√
• Probability density functions
√
• Parameter estimation
√
• Prediction
√
• Multivariate probability models
• Conditional probability models
• Information theory
(49)
(50)
(51)
(52)
(53)
(54)
Multivariate probability models
Independent features (bag of words model)
• Each data case Xn is now a vector of variables Xni, i = 1 : p,
n=1:N
• e.g., Xni ∈ {1, . . . , K} is the i’th word in n’th sentence
• e.g., Xni ∈ IR is the i’th attribute in the n’th feature vector
• We need to define p(Xn|θ) for feature vectors of potentially variable
size.
Y
p(Xn|θ) =
i
p(D|θ) =
YY
n
i
p(Xni|θi)
(56)
e.g., Xni ∈ {1, . . . , K}, product of multinomials
Y Y I(X =k)
θik ni
p(Xn|θ) =
(57)
i
i k
def X
Nik =
`(θi) =
n
X
(58)
(60)
I(Xni = k)
(61)
Nik log θik
(62)
k
N
θik = ik
Ni
k
Bayes estimate for bag of words model
We can estimate each θi separately, eg. product of multinomials
Y Y Y I(X =k)
θik ni
(59)
p(D|θ) =
n i k
YY
N
θikik
=
(55)
e.g., Xni ∈ {0, 1}, product of Bernoullis
Y X
θi ni (1 − θi)Xni
p(Xn|θ) =
i
MLE for bag of words model
p(Xni|θi)
(63)
p(θ|D) ∝
Y
i
p(θi)[
YY
n
i
p(Xni|θi)]
We can update each θi separately, eg. product of multinomials
Y N
Y
Dir(θi|αi1, . . . , αiK )[ θikik ]
p(θ|D) =
∝
i
Y
i
(64)
(65)
k
Dir(θi|αi1 + Ni1, . . . , αiK + NiK )
Factored prior × factored likelihood = factored posterior
(66)
MLE for Markov model
(Discrete time) Markov model
p(Xn|θ) = p(Xn1|π)
p
Y
t=2
p(Xnt|Xn,t−1, A)
(67)
For a discrete state space, Xnt ∈ {1, . . . , K},
πi = p(X1 = i)
(68)
Aij = p(Xt = j|Xt−1 = i)
(69)
Y I(X =i) Y Y Y I(Xnt=j,Xn,t−1=i)
(70)
Aij
p(Xn|θ) =
πi n1
t
i j
Y N 1 Y Y Nij
Aij
πi i
=
i j
i
def
Ni1 = I(Xn1 = i)
def X
Nij =
I(Xn,t = j, Xn,t−1 = i)
t
i
(71)
(72)
(73)
p(D|θ) =
Y Y I(X =i) Y Y Y I(Xnt=j,Xn,t−1=i)
(74)
Aij
πi n1
n
(80)
def X
(76)
Ni1 =
I(Xn1 = i)
n
def X X
Nij =
n
I(Xn,t = j, Xn,t−1 = i)
π̂i =
(78)
Ni1
N
• Probability basics
(79)
√
• Probability density functions
√
• Parameter estimation
√
• Prediction
√
• Conditional probability models
(81)
Since π1 + π2 = 1, we have
π1 =
β
α
, π2 =
,
α+β
α+β
(82)
(77)
t
Nij
Âij = P
j 0 Nij 0
• Multivariate probability models
Balance flow across cut set
j
j
Outline
πi is fraction of time we spend in state i, given by principle eigenvector
π1 α = π2 β
i
(75)
i
i
Stationary distribution for a Markov chain
Aπ = π
t
i
Y N 1 Y Y Nij
Aij
πi i
=
• Information theory
√
Conditional density estimation
Naive Bayes = conditional bag of words model
Generative model
• Goal: learn p(y|x) where x is input and y is output.
• Generative model
p(y|x) ∝ p(x|y)p(y)
(83)
• Discriminative model e.g., logistic regression
p(y|x) = σ(w T x)
(84)
p(y|x) ∝ p(x|y)p(y)
in which features of x are conditionally independent given y:
p
Y
p(y|x) ∝ p(y)[ p(xi|y)]
(85)
(86)
i=1
e.g., for multinomials
p(y|x) ∝
MLE for Naive Bayes
c
Ncik =
i
k
c
I(xni = k, yn = c)
(89)
n
N
θ̂cik = P cik
k 0 Ncik 0
Nc
π̂c =
N
c
i
(87)
k
Conditional Markov model
We estimate params for each feature separately, partitioning the data
based on label y.
Y
Y YY N
θcikcik ]
(88)
p(D|θ, π) = [ πcNc ] [
X
Y I(y=c) Y Y I(x =k,y=c)
πc
[
θcik i
]
(90)
(91)
p(x|y) = p(x1|y)
Y
t
p(xt|xt−1, y)
(92)
Fit a separate Markov model for every class of y. See homework 5.
Summary
Summary
• Probability basics: Bayes rule, pdfs, pmfs
• Probability density functions: Bernoulli, Multinomial; Gaussian,
Beta, Dirichlet
• Parameter estimation: MLE, Bayesian
• Prediction: Plug-in, Bayesian, cross validation
• Multivariate probability models: Bag of words, Markov models
• Conditional probability models: Conditional BOW (Naive Bayes),
Conditional Markov
• Information theory
Entropy
• Measure of uncertainty.
• Definition for PMF pk , k = 1 : K
X
pk log2 pk
H(p) = −
Joint entropy
H(X, Y ) = −
(93)
k
• eg binary entropy function, p1 = θ, p0 = 1 − θ
H(θ) = −[p(X = 1) log2 p(X = 1) + p(X = 0) log2 p(X = 0)]
(94)
= −[θ log2 θ + (1 − θ) log2(1 − θ)]
(95)
X
p(x, y) log p(x, y)
(96)
x,y
If X ⊥ Y , then H(X, Y ) = H(X) + H(Y ).
Pf:
X
H(X, Y ) = −
p(x)p(y) log p(x)p(y)
(97)
x,y
= −
X
x,y
p(x)p(y) log p(x) −
X
p(x)p(y) log p(y) (98)
x,y
In general
Pf:
H(X, Y ) ≤ H(X) + H(Y )
(99)
I(X, Y ) = H(X) + H(Y ) − H(X, Y ) ≥ 0
(100)
Conditional entropy
def X
H(Y |X) =
x
p(x)H(Y |X = x)
= H(X, Y ) − H(X)
Mutual information
(101)
(102)
(103)
But ∃y.H(X|y) > H(X).
Mutual information ≥ 0
I(X, Y ) = 0 if X ⊥ Y . Pf.
X X
p(x)p(y)
p(x)p(y) log
I(X; Y ) =
p(x) p(y)
y∈Y x∈X
XX
p(x)p(y) log 1 = 0
=
y
I(X, Y ) ≥ 0. Pf.
p(x, y) log
p(x, y)
p(x) p(y)
(104)
In hw 5, you showed that
I(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)
(105)
I(X, Y ) = H(X) + H(Y ) − H(X, Y )
(106)
Subsituting H(Y |X) = H(X, Y ) − H(X) yields
Relative entropy
(107)
(108)
x
I(X, Y ) = KL(p(x, y)||p(x)p(y)) ≥ 0
X X
y∈Y x∈X
In hw 5, you showed that H(Y |X) ≤ H(Y ).
Pf
H(Y |X) = H(X, Y ) − H(X) ≤ H(X) + H(Y ) − H(Y )
I(X; Y ) =
def X
p
pk log k
qk
=
pk log pk −
D(p||q) =
k
X
kX
= −
(109)
k
(110)
X
pk log qk
(111)
k
pk log qk − H(p)
(112)
Hence measures extra number of bits needed to encode data if we use
q instead of p.
D(p||q) = 0 if p = q. Pf: trivial.
D(p||q) ≥ 0. Pf: use Jensen’s inequality.
MDL principle
Minimum description length principle.
To encode an event X = x which occurs with p(x) takes − log p(x)
bits. Total cost of data D and model H is
L(D, H) = − log p(D|H) − log p(H)
(113)
H ∗ = arg min L(D, H)
(114)
MDL principle says: pick
H
This is equivalent to the MAP hypothesis
H ∗ = arg max log p(D|H) + log p(H)
H
#bits total
#bits for model
#bits for data
best model
(115)
Related documents