Download Probabilistic Latent Semantic Analysis - KTI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gaussian elimination wikipedia , lookup

Matrix multiplication wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Transcript
Knowledge Discovery and Data Mining 1 (VO) (707.003)
Probabilistic Latent Semantic Analysis
Denis Helic
KTI, TU Graz
Jan 16, 2014
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
1 / 47
Big picture: KDDM
Probability Theory
Linear Algebra
Information Theory
Statistical Inference
Mathematical Tools
Map-Reduce
Infrastructure
Data Mining
Transformation
Preprocessing
Knowledge Discovery Process
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
2 / 47
Outline
1
Introduction adn Recap
2
Probabilistic Generative Models
3
Topic Models
4
Probabilistic Latent Semantic Analysis
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
3 / 47
Introduction adn Recap
Short recap: SVD and LSA
Singular Value Decomposition
Let M ∈ Rm×n be a matrix and let r be the rank of M (the rank of a
matrix is the largest number of linearly independent rows or columns).
Then we can find matrices U, V, and Σ with the following properties:
U ∈ Rm×r is a column-orthonormal matrix
V ∈ Rn×r is a column-orthonormal matrix
Σ ∈ Rr ×r is a diagonal matrix.
The matrix M can be then written as:
M = UΣVT
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
4 / 47
2. V is an n × r column-orthonormal
matrix. Note that we always use V in
Introduction adn Recap
its transposed form, so it is the rows of V T that are orthonormal.
Short recap: SVD and LSA
3. Σ is a diagonal matrix; that is, all elements not on the main diagonal are
0. The elements of Σ are called the singular values of M .
r
n
r
Σ
m
M
=
n
VT
r
U
Figure 11.5: The form of a singular-value decomposition
Figure: Figure from “Mining Massive Datasets”
Example
11.8 : In Fig. 11.6 is a rank-2
matrix representing ratings
of movies
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
5 / 47
Introduction adn Recap
Short recap: SVD and LSA
Let M be a utility matrix with people ratings for the movies
The rows of M are people, the columns of M are movies
The rows of U are people, the columns of U are concepts
U connects people to concepts
Then the rows of VT are concepts, the columns of VT are movies
V connects movies to concepts
Σ represents the importance of concepts
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
6 / 47
Introduction adn Recap
Short recap: SVD and LSA
Let M be a term-document matrix with term occurrences in the
documents
The rows of M are terms, the columns of M are documents
The rows of U are terms, the columns of U are concepts
U connects terms to concepts
Then the rows of VT are concepts, the columns of VT are documents
V connects documents to concepts
Σ represents the importance of concepts
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
7 / 47
Introduction adn Recap
Short recap: SVD and LSA
Vector Space Model: documents are represented as term vectors
Cosine similarity to compute scores
Vector Space Model can not cope with two classic problems arising in
natural languages
Synonymy : two words having the same meaning
Polysemy : one word having multiple meanings
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
8 / 47
Introduction adn Recap
Short recap: SVD and LSA
In latent semantic analysis (LSA) or latent semantic indexing (LSI)
we use SVD to create a low-rank approximation of the
term-document matrix
We select k largest singular values and create Mk approximation to
the original matrix
We thus map each term/document to a k-dimensional space of
“concepts”
These concepts are hidden (latent) in the collection
They represent the semantic of the terms and documents
E.g. the topics of terms and documents
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
9 / 47
Introduction adn Recap
Short recap: SVD and LSA
By computing low-rank approximation of the original term-document
matrix the SVD brings together the terms with similar co-occurrences
Retrieval quality may actually be improved by the approximation!
As we reduce k recall improves
A value of k in low hundreds tend to increase precision as well (this
suggests that a suitable k addresses some of the challenges of
synonymy)
Retrieval by folding the query into the low-rank space
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
10 / 47
Introduction adn Recap
Disadvantages of LSA
Statistical foundation is missing
SVD assumes normally distributed data
But, term occurrence is not normally distributed
Still, often it works remarkably good! Why?
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
11 / 47
Introduction adn Recap
Disadvantages of LSA
Statistical foundation is missing
SVD assumes normally distributed data
But, term occurrence is not normally distributed
Still, often it works remarkably good! Why?
Matrix entries are weighted (e.g. tf-idf) and those weighted entries
may be normally distributed
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
11 / 47
Probabilistic Generative Models
Recap: Model-based methods
Statistical inference is based on fitting a probabilistic model of data
The idea is based on a probabilistic or generative model
Such models assign a probability for observing specific data examples,
e.g. observing words in a text document
Generative models are a powerful method to encode specific
assumptions of how unknown parameters interact to create data
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
12 / 47
Probabilistic Generative Models
Recap: Generative models
How does a generative network model work?
It defines a conditional probability distribution over data given a
hypothesis P(D|h)
Given h we generate data from the conditional distribution P(D|h)
Generative models have many advantages
The main disadvantage is that fitting of the models can be more
complicated than an algorithmic approach
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
13 / 47
Probabilistic Generative Models
Recap: Inference
(Statistical) inference is the reverse of the generation process
We are given some data D, e.g. a collection of documents
We want to estimate the model, or more precisely the parameters of
the hypothesis h that are most likely to have generated the data
generation
P(D|h)
D
inference
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
14 / 47
Probabilistic Generative Models
Recap: Naive Bayes document models
We discussed generative models in connection with Naive Bayes
classification
We introduced multinomial generative model and Bernoulli generative
model
In the multinomial model we assume that the documents are
generated from a multinomial distribution, i.e. the number of
occurrences of terms in document is a multinomial r.v.
In the Bernoulli model we assume that the documents are generated
from a multivariate Bernoulli distribution
The distributions were conditioned on the document class
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
15 / 47
Topic Models
Topic models
Document class is something that we observe in our data (at least in
the training data)
Other observable entities: documents and words
However, there are some entities which are present but not
observable, i.e. they are hidden
They are latent
E.g. concepts in LSA
Let us call those entities topics
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
16 / 47
Topic Models
Topic models
A topic model is a probabilistic generative model that we can use to
generate the observable data, i.e. documents and words
In the other direction: inference
When we observe a specific data instance we can infer the model
Probabilistic model: we will have joint probability distributions
Typically we will work with conditional probability distributions
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
17 / 47
Topic Models
Probabilistic topic models
Each document is a probability distribution over topics
Distribution over topics represents the essence, the body, or the gist
of a given document
Each topic is a probability distribution over words
Topic “Education”: School, Students, Education, University, ...
Topic “Budget”: Million, Finance, Tax, Program, ...
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
18 / 47
Topic Models
Document generation process
1
For each document d choose a mixture of topics z
2
For every word slot draw a topic from the mixture with probability
p(z|d)
3
Then draw a word from the topic with probability p(w |z)
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
19 / 47
Topic Models
Document generation process
Figure: Figure from slides by Thomas Huffman
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
20 / 47
Probabilistic Latent Semantic Analysis
Document generation process
z
w
N
M
(b) mixture of unigrams
d
z
w
N
M
(c) pLSI/aspect model
Figure: Figure from LDA by Blei et al.
Figure 3: Graphical model representation of different models of discrete data.
xture of unigrams
ugment
the unigram model with a discrete
random topic variable zJan
(Figure
3b),
we o
Denis Helic (KTI, TU Graz)
KDDM1
16, 2014
21 / 47
Probabilistic Latent Semantic Analysis
Distributions
We are interested in the joint probability of the observable variables:
p(d, w )
However, we have a joint probability of the observable and latent
variables p(d, w , z)
Thus, we have to marginalize over z to obtain p(d, w )
p(d, w ) =
X
p(d, w , z) =
z
Denis Helic (KTI, TU Graz)
X
z
KDDM1
p(d, w |z)p(z)
Jan 16, 2014
22 / 47
Probabilistic Latent Semantic Analysis
Recap: Conditional independence
Definition
Suppose P(C ) > 0. Event A and B are conditionally independent given C
if:
P(A ∩ B|C ) = P(A|C )P(B|C )
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
23 / 47
Probabilistic Latent Semantic Analysis
Distributions
We made the same assumption in Naive Bayes classification
Documents and words are conditionally independent given the topic:
p(d, w |z) = p(d|z)p(w |z)
p(d, w ) =
X
z
Denis Helic (KTI, TU Graz)
p(d|z)p(w |z)p(z)
KDDM1
Jan 16, 2014
24 / 47
Probabilistic Latent Semantic Analysis
Distributions
p(d, w ) =
X
z
p(d|z)p(w |z)p(z)
This is symmetric formulation of pLSA
We select a topic z and then with the probability p(d|z) a document
d and then with the probability p(w |z) words for that document
We repeat the process for all documents
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
25 / 47
Probabilistic Latent Semantic Analysis
Distributions
We can reformulate the last equation
Let us see what is p(d, z) again using the assumption that d and w
are independent
p(d, z) = p(z)p(d|z) = p(d)p(z|d)
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
26 / 47
Probabilistic Latent Semantic Analysis
Distributions
We can now substitute in the symmetric equation
p(d, w ) =
X
z
=
X
z
p(d|z)p(w |z)p(z)
p(z|d)p(w |z)p(d)
= p(d)
X
z
Denis Helic (KTI, TU Graz)
KDDM1
p(z|d)p(w |z)
Jan 16, 2014
27 / 47
Probabilistic Latent Semantic Analysis
Distributions
This is asymmetric formulation
Thus, we first pick a document with p(d) and then select all words
for that document from p(w |d) given by
p(d, w ) = p(w |d)p(d) =⇒
X
p(w |d) =
p(w |z)p(z|d)
z
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
28 / 47
Probabilistic Latent Semantic Analysis
pLSA Decomposition
p(wi |dj ) =
K
X
k=1
p(wi |zk )p(zk |dj )
Figure: Figure from slides by Josef Sivic
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
29 / 47
11.3. SINGULAR-VALUE
DECOMPOSITION
Probabilistic Latent Semantic
Analysis
409
pLSA
with SVD matrix.
2. V comparison
is an n × r column-orthonormal
Note that we always use V in
its transposed form, so it is the rows of V T that are orthonormal.
X
p(d, w )that
= is, all
p(w
|z)p(z)p(d|z)
3. Σ is a diagonal matrix;
elements
not on the main diagonal are
0. The elements of Σ are called
the singular values of M .
z
r
n
r
Σ
m
M
=
n
r
VT
U
Figure 11.5: The form of a singular-value decomposition
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
30 / 47
Probabilistic Latent Semantic Analysis
pLSA comparison with SVD
Word probabilities given topics p(w |z): matrix U
Document probabilities given topics p(d|z): matrix V
Topic probabilities p(z): matrix Σ
Difference: values in all matrices are normalized and non-negative
They are probabilities
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
31 / 47
Probabilistic Latent Semantic Analysis
Parameter inference
We will infer parameters using Maximum Likelihood Estimator (MLE)
First, we need to write down the likelihood function
Let n(wi , dj ) be the number of occurrences of word wi in document dj
p(wi , dj ) is the probability of observing a single occurrence word wi in
document dj
Then, the probability of observing n(wi , dj ) occurrences of word wi in
document dj is given by:
p(wi , dj )n(wi ,dj )
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
32 / 47
Probabilistic Latent Semantic Analysis
Parameter inference
The probability of observing the complete document collection is then
given by the product of probabilities of observing every single word in
every document with corresponding number of occurrences
That is then the likelihood
L=
m Y
n
Y
p(wi , dj )n(wi ,dj )
i=1 j=1
L =
=
m X
n
X
n(wi , dj )log (p(wi , dj ))
i=1 j=1
m X
n
X
i=1 j=1
Denis Helic (KTI, TU Graz)
n(wi , dj )log (
K
X
l=1
KDDM1
p(wi |zl )p(zl )p(dj |zl ))
Jan 16, 2014
33 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
We can not maximize the likelihood analytically because of the
logarithm of the sum
A standard procedure is to use an algorithm called
Expectation-Maximization (EM)
This is an iterative method to estimate parameters of the models with
latent variables
Each iteration consists of two steps: expectation step (E) and
maximization step (M)
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
34 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
In the E step we create a function of the expectation of the
log-likelihood using the current parameter estimates
In the M step we compute parameters which maximize the
expectation of the log-likelihood
These parameter estimates are used to determine the distribution of
the latent variables in the next E step
Let us illustrate the EM algorithm in a general case
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
35 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
We observe some data D generated by a probabilistic model with
parameters θ and some latent variables z
We are interested in the likelihood of that data D given the
parameters θ: p(D|θ)
However, there exist a joint probability distribution of data D and
latent variables z: p(D, z|θ)
Thus, to obtain p(D|θ) we have to marginalize out z:
p(D|θ) =
X
p(D|z, θ)p(z|θ)
z
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
36 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
We are now interested in maximizing this likelihood, which is
equivalent to maximizing log-likelihood
X
log (p(D|θ)) = log (
p(D|z, θ)p(z|θ))
z
Jensen’s inequality for concave functions such as log gives us:
E [f (x)] ≤ f (E [x])
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
37 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
X
X
q(z)
log (
p(D|z, θ)p(z|θ)) = log (
p(D|z, θ)p(z|θ)
)
q(z)
z
z
P
p(D|z, θ)p(z|θ)
= log ( z
q(z))
q(z)
p(D|z, θ)p(z|θ)
])
= log (E [
q(z)
This is by the Jensen’s inequality greater or equal to:
log (E [
p(D|z, θ)p(z|θ)
p(D|z, θ)p(z|θ)
]) ≥ E [log (
)]
q(z)
q(z)
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
38 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
)] is then lower bound on the likelihood
E [log ( p(D|z,θ)p(z|θ)
q(z)
Thus, we can maximize this lower bound
EM algorithm maximizes exactly this lower bound
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
39 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
E [log (
p(D|z, θ)p(z|θ)
)]
q(z)
=
X
q(z)log (
z
=
=
=
p(D|z, θ)p(z|θ)
)
q(z)
p(z|D, θ)p(D|θ)
)
q(z)
z
X
X
p(z|D, θ)
q(z)log (p(D|θ)) +
q(z)log (
)
q(z)
z
z
X
q(z)
log (p(D|θ)) −
q(z)log (
)
p(z|D, θ)
z
X
q(z)log (
This will have maximum when
P
z
q(z)
q(z)log ( p(z|D,θ)
)=0
This is the case when q(z) = p(z|D, θ)
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
40 / 47
Probabilistic Latent Semantic Analysis
EM algorithm
p(z|D, θ) is the posterior of z
P
q(z)
z q(z)log ( p(z|D,θ) ) is Kullback-Leibler (KL) divergence
Thus, in E step we use the current values of parameters to calculate
the posterior of z
M step is then problem dependent
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
41 / 47
Probabilistic Latent Semantic Analysis
EM algorithm for pLSA
p(z|w , d) =
=
=
Denis Helic (KTI, TU Graz)
p(z, w , d)
p(w , d)
p(d)p(z|d)p(w |z)
P
z p(d)p(z|d)p(w |z)
p(z|d)p(w |z)
P
z p(z|d)p(w |z)
p(w |z) ∝
X
p(z|d) ∝
X
n(d, w )p(z|d, w )
d
n(d, w )p(z|d, w )
w
KDDM1
Jan 16, 2014
42 / 47
Probabilistic Latent Semantic Analysis
Example
IPython Notebook examples
Slightly modified code from: https://github.com/hitalex/PLSA
http:
//kti.tugraz.at/staff/denis/courses/kddm1/plsa.ipynb
Command Line
ipython notebook –pylab=inline plsa.ipynb
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
43 / 47
Probabilistic Latent Semantic Analysis
Example
Movie
User
Joe
Jim
John
Jack
Jill
Jenny
Jane
Denis Helic (KTI, TU Graz)
Matrix
Alien
Star Wars
Casablanca
Titanic
1
3
4
5
0
0
0
1
3
4
5
0
0
0
1
3
4
5
0
0
0
0
0
0
0
4
5
2
0
0
0
0
4
5
2
KDDM1
Jan 16, 2014
44 / 47
Probabilistic Latent Semantic Analysis
Example
Movie
User
Joe
Jim
John
Jack
Jill
Jenny
Jane
Denis Helic (KTI, TU Graz)
Matrix
Alien
Star Wars
Casablanca
Titanic
1
3
4
5
0
0
0
1
3
4
5
2
0
1
1
3
4
5
0
0
0
0
0
0
0
4
5
2
0
0
0
0
4
5
2
KDDM1
Jan 16, 2014
45 / 47
Probabilistic Latent Semantic Analysis
Example
\segment 1" \segment 2" \matrix 1"
\matrix 2"
\line 1"
\line 2" \power 1"
power 2"
imag
speaker
robust
manufactur constraint alpha POWER
load
SEGMENT
speech
MATRIX
cell
LINE
redshift spectrum memori
texture
recogni
eigenvalu
part
match
LINE
omega
vlsi
color
signal
uncertainti MATRIX
locat
galaxi
mpc
POWER
tissue
train
plane
cellular
imag
quasar
hsup
systolic
brain
hmm
linear
famili
geometr absorp
larg
input
slice
source
condition
design
impos
high
redshift
complex
cluster
speakerind.
perturb machinepart segment
ssup
galaxi
arrai
mri
SEGMENT
root
format
fundament densiti standard
present
volume
sound
suci
group
recogn
veloc
model implement
Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most
probable words in the class-conditional
distribution
P (wjz ), from 2000
top to bottom in descending order.
Figure:
From Hofmann,
Document 1, P fzk jd1 ; wj = `segment`g = (0:951; 0:0001; : : :)
P fwj = `segment`jd1g = 0:06
SEGMENT medic imag challeng problem eld imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need
applic involv estim boundari object classif tissu abnorm shape analysi contour detec textur SEGMENT despit exist techniqu SEGMENT
specif medic imag remain crucial problem [...]
Document 2, P fz jd ; wj = `segment`g = (0:025; 0:867; KDDM1
: : :)
k 2
Denis Helic (KTI, TU Graz)
Jan 16, 2014
46 / 47
Probabilistic Latent Semantic Analysis
Performance
The performance of a retrieval system based on this model (PLSI)
found superior
to that
both the
vectorand
space
based similarity
In IRwas
typically
superior
to ofboth:
VSM
LSA
(cos) and a non-probabilistic latent semantic indexing (LSI) method.
(We skip details here.)
From Th. Hofmann, 2000
Figure: From Hofmann, 2000
Denis Helic (KTI, TU Graz)
KDDM1
Jan 16, 2014
47 / 47