Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery and Data Mining 1 (VO) (707.003) Probabilistic Latent Semantic Analysis Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 47 Big picture: KDDM Probability Theory Linear Algebra Information Theory Statistical Inference Mathematical Tools Map-Reduce Infrastructure Data Mining Transformation Preprocessing Knowledge Discovery Process Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 2 / 47 Outline 1 Introduction adn Recap 2 Probabilistic Generative Models 3 Topic Models 4 Probabilistic Latent Semantic Analysis Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 3 / 47 Introduction adn Recap Short recap: SVD and LSA Singular Value Decomposition Let M ∈ Rm×n be a matrix and let r be the rank of M (the rank of a matrix is the largest number of linearly independent rows or columns). Then we can find matrices U, V, and Σ with the following properties: U ∈ Rm×r is a column-orthonormal matrix V ∈ Rn×r is a column-orthonormal matrix Σ ∈ Rr ×r is a diagonal matrix. The matrix M can be then written as: M = UΣVT Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 4 / 47 2. V is an n × r column-orthonormal matrix. Note that we always use V in Introduction adn Recap its transposed form, so it is the rows of V T that are orthonormal. Short recap: SVD and LSA 3. Σ is a diagonal matrix; that is, all elements not on the main diagonal are 0. The elements of Σ are called the singular values of M . r n r Σ m M = n VT r U Figure 11.5: The form of a singular-value decomposition Figure: Figure from “Mining Massive Datasets” Example 11.8 : In Fig. 11.6 is a rank-2 matrix representing ratings of movies Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 5 / 47 Introduction adn Recap Short recap: SVD and LSA Let M be a utility matrix with people ratings for the movies The rows of M are people, the columns of M are movies The rows of U are people, the columns of U are concepts U connects people to concepts Then the rows of VT are concepts, the columns of VT are movies V connects movies to concepts Σ represents the importance of concepts Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 6 / 47 Introduction adn Recap Short recap: SVD and LSA Let M be a term-document matrix with term occurrences in the documents The rows of M are terms, the columns of M are documents The rows of U are terms, the columns of U are concepts U connects terms to concepts Then the rows of VT are concepts, the columns of VT are documents V connects documents to concepts Σ represents the importance of concepts Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 7 / 47 Introduction adn Recap Short recap: SVD and LSA Vector Space Model: documents are represented as term vectors Cosine similarity to compute scores Vector Space Model can not cope with two classic problems arising in natural languages Synonymy : two words having the same meaning Polysemy : one word having multiple meanings Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 8 / 47 Introduction adn Recap Short recap: SVD and LSA In latent semantic analysis (LSA) or latent semantic indexing (LSI) we use SVD to create a low-rank approximation of the term-document matrix We select k largest singular values and create Mk approximation to the original matrix We thus map each term/document to a k-dimensional space of “concepts” These concepts are hidden (latent) in the collection They represent the semantic of the terms and documents E.g. the topics of terms and documents Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 9 / 47 Introduction adn Recap Short recap: SVD and LSA By computing low-rank approximation of the original term-document matrix the SVD brings together the terms with similar co-occurrences Retrieval quality may actually be improved by the approximation! As we reduce k recall improves A value of k in low hundreds tend to increase precision as well (this suggests that a suitable k addresses some of the challenges of synonymy) Retrieval by folding the query into the low-rank space Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 10 / 47 Introduction adn Recap Disadvantages of LSA Statistical foundation is missing SVD assumes normally distributed data But, term occurrence is not normally distributed Still, often it works remarkably good! Why? Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 11 / 47 Introduction adn Recap Disadvantages of LSA Statistical foundation is missing SVD assumes normally distributed data But, term occurrence is not normally distributed Still, often it works remarkably good! Why? Matrix entries are weighted (e.g. tf-idf) and those weighted entries may be normally distributed Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 11 / 47 Probabilistic Generative Models Recap: Model-based methods Statistical inference is based on fitting a probabilistic model of data The idea is based on a probabilistic or generative model Such models assign a probability for observing specific data examples, e.g. observing words in a text document Generative models are a powerful method to encode specific assumptions of how unknown parameters interact to create data Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 12 / 47 Probabilistic Generative Models Recap: Generative models How does a generative network model work? It defines a conditional probability distribution over data given a hypothesis P(D|h) Given h we generate data from the conditional distribution P(D|h) Generative models have many advantages The main disadvantage is that fitting of the models can be more complicated than an algorithmic approach Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 13 / 47 Probabilistic Generative Models Recap: Inference (Statistical) inference is the reverse of the generation process We are given some data D, e.g. a collection of documents We want to estimate the model, or more precisely the parameters of the hypothesis h that are most likely to have generated the data generation P(D|h) D inference Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 14 / 47 Probabilistic Generative Models Recap: Naive Bayes document models We discussed generative models in connection with Naive Bayes classification We introduced multinomial generative model and Bernoulli generative model In the multinomial model we assume that the documents are generated from a multinomial distribution, i.e. the number of occurrences of terms in document is a multinomial r.v. In the Bernoulli model we assume that the documents are generated from a multivariate Bernoulli distribution The distributions were conditioned on the document class Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 15 / 47 Topic Models Topic models Document class is something that we observe in our data (at least in the training data) Other observable entities: documents and words However, there are some entities which are present but not observable, i.e. they are hidden They are latent E.g. concepts in LSA Let us call those entities topics Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 16 / 47 Topic Models Topic models A topic model is a probabilistic generative model that we can use to generate the observable data, i.e. documents and words In the other direction: inference When we observe a specific data instance we can infer the model Probabilistic model: we will have joint probability distributions Typically we will work with conditional probability distributions Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 17 / 47 Topic Models Probabilistic topic models Each document is a probability distribution over topics Distribution over topics represents the essence, the body, or the gist of a given document Each topic is a probability distribution over words Topic “Education”: School, Students, Education, University, ... Topic “Budget”: Million, Finance, Tax, Program, ... Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 18 / 47 Topic Models Document generation process 1 For each document d choose a mixture of topics z 2 For every word slot draw a topic from the mixture with probability p(z|d) 3 Then draw a word from the topic with probability p(w |z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 19 / 47 Topic Models Document generation process Figure: Figure from slides by Thomas Huffman Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 20 / 47 Probabilistic Latent Semantic Analysis Document generation process z w N M (b) mixture of unigrams d z w N M (c) pLSI/aspect model Figure: Figure from LDA by Blei et al. Figure 3: Graphical model representation of different models of discrete data. xture of unigrams ugment the unigram model with a discrete random topic variable zJan (Figure 3b), we o Denis Helic (KTI, TU Graz) KDDM1 16, 2014 21 / 47 Probabilistic Latent Semantic Analysis Distributions We are interested in the joint probability of the observable variables: p(d, w ) However, we have a joint probability of the observable and latent variables p(d, w , z) Thus, we have to marginalize over z to obtain p(d, w ) p(d, w ) = X p(d, w , z) = z Denis Helic (KTI, TU Graz) X z KDDM1 p(d, w |z)p(z) Jan 16, 2014 22 / 47 Probabilistic Latent Semantic Analysis Recap: Conditional independence Definition Suppose P(C ) > 0. Event A and B are conditionally independent given C if: P(A ∩ B|C ) = P(A|C )P(B|C ) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 23 / 47 Probabilistic Latent Semantic Analysis Distributions We made the same assumption in Naive Bayes classification Documents and words are conditionally independent given the topic: p(d, w |z) = p(d|z)p(w |z) p(d, w ) = X z Denis Helic (KTI, TU Graz) p(d|z)p(w |z)p(z) KDDM1 Jan 16, 2014 24 / 47 Probabilistic Latent Semantic Analysis Distributions p(d, w ) = X z p(d|z)p(w |z)p(z) This is symmetric formulation of pLSA We select a topic z and then with the probability p(d|z) a document d and then with the probability p(w |z) words for that document We repeat the process for all documents Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 25 / 47 Probabilistic Latent Semantic Analysis Distributions We can reformulate the last equation Let us see what is p(d, z) again using the assumption that d and w are independent p(d, z) = p(z)p(d|z) = p(d)p(z|d) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 26 / 47 Probabilistic Latent Semantic Analysis Distributions We can now substitute in the symmetric equation p(d, w ) = X z = X z p(d|z)p(w |z)p(z) p(z|d)p(w |z)p(d) = p(d) X z Denis Helic (KTI, TU Graz) KDDM1 p(z|d)p(w |z) Jan 16, 2014 27 / 47 Probabilistic Latent Semantic Analysis Distributions This is asymmetric formulation Thus, we first pick a document with p(d) and then select all words for that document from p(w |d) given by p(d, w ) = p(w |d)p(d) =⇒ X p(w |d) = p(w |z)p(z|d) z Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 28 / 47 Probabilistic Latent Semantic Analysis pLSA Decomposition p(wi |dj ) = K X k=1 p(wi |zk )p(zk |dj ) Figure: Figure from slides by Josef Sivic Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 29 / 47 11.3. SINGULAR-VALUE DECOMPOSITION Probabilistic Latent Semantic Analysis 409 pLSA with SVD matrix. 2. V comparison is an n × r column-orthonormal Note that we always use V in its transposed form, so it is the rows of V T that are orthonormal. X p(d, w )that = is, all p(w |z)p(z)p(d|z) 3. Σ is a diagonal matrix; elements not on the main diagonal are 0. The elements of Σ are called the singular values of M . z r n r Σ m M = n r VT U Figure 11.5: The form of a singular-value decomposition Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 30 / 47 Probabilistic Latent Semantic Analysis pLSA comparison with SVD Word probabilities given topics p(w |z): matrix U Document probabilities given topics p(d|z): matrix V Topic probabilities p(z): matrix Σ Difference: values in all matrices are normalized and non-negative They are probabilities Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 31 / 47 Probabilistic Latent Semantic Analysis Parameter inference We will infer parameters using Maximum Likelihood Estimator (MLE) First, we need to write down the likelihood function Let n(wi , dj ) be the number of occurrences of word wi in document dj p(wi , dj ) is the probability of observing a single occurrence word wi in document dj Then, the probability of observing n(wi , dj ) occurrences of word wi in document dj is given by: p(wi , dj )n(wi ,dj ) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 32 / 47 Probabilistic Latent Semantic Analysis Parameter inference The probability of observing the complete document collection is then given by the product of probabilities of observing every single word in every document with corresponding number of occurrences That is then the likelihood L= m Y n Y p(wi , dj )n(wi ,dj ) i=1 j=1 L = = m X n X n(wi , dj )log (p(wi , dj )) i=1 j=1 m X n X i=1 j=1 Denis Helic (KTI, TU Graz) n(wi , dj )log ( K X l=1 KDDM1 p(wi |zl )p(zl )p(dj |zl )) Jan 16, 2014 33 / 47 Probabilistic Latent Semantic Analysis EM algorithm We can not maximize the likelihood analytically because of the logarithm of the sum A standard procedure is to use an algorithm called Expectation-Maximization (EM) This is an iterative method to estimate parameters of the models with latent variables Each iteration consists of two steps: expectation step (E) and maximization step (M) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 34 / 47 Probabilistic Latent Semantic Analysis EM algorithm In the E step we create a function of the expectation of the log-likelihood using the current parameter estimates In the M step we compute parameters which maximize the expectation of the log-likelihood These parameter estimates are used to determine the distribution of the latent variables in the next E step Let us illustrate the EM algorithm in a general case Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 35 / 47 Probabilistic Latent Semantic Analysis EM algorithm We observe some data D generated by a probabilistic model with parameters θ and some latent variables z We are interested in the likelihood of that data D given the parameters θ: p(D|θ) However, there exist a joint probability distribution of data D and latent variables z: p(D, z|θ) Thus, to obtain p(D|θ) we have to marginalize out z: p(D|θ) = X p(D|z, θ)p(z|θ) z Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 36 / 47 Probabilistic Latent Semantic Analysis EM algorithm We are now interested in maximizing this likelihood, which is equivalent to maximizing log-likelihood X log (p(D|θ)) = log ( p(D|z, θ)p(z|θ)) z Jensen’s inequality for concave functions such as log gives us: E [f (x)] ≤ f (E [x]) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 37 / 47 Probabilistic Latent Semantic Analysis EM algorithm X X q(z) log ( p(D|z, θ)p(z|θ)) = log ( p(D|z, θ)p(z|θ) ) q(z) z z P p(D|z, θ)p(z|θ) = log ( z q(z)) q(z) p(D|z, θ)p(z|θ) ]) = log (E [ q(z) This is by the Jensen’s inequality greater or equal to: log (E [ p(D|z, θ)p(z|θ) p(D|z, θ)p(z|θ) ]) ≥ E [log ( )] q(z) q(z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 38 / 47 Probabilistic Latent Semantic Analysis EM algorithm )] is then lower bound on the likelihood E [log ( p(D|z,θ)p(z|θ) q(z) Thus, we can maximize this lower bound EM algorithm maximizes exactly this lower bound Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 39 / 47 Probabilistic Latent Semantic Analysis EM algorithm E [log ( p(D|z, θ)p(z|θ) )] q(z) = X q(z)log ( z = = = p(D|z, θ)p(z|θ) ) q(z) p(z|D, θ)p(D|θ) ) q(z) z X X p(z|D, θ) q(z)log (p(D|θ)) + q(z)log ( ) q(z) z z X q(z) log (p(D|θ)) − q(z)log ( ) p(z|D, θ) z X q(z)log ( This will have maximum when P z q(z) q(z)log ( p(z|D,θ) )=0 This is the case when q(z) = p(z|D, θ) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 40 / 47 Probabilistic Latent Semantic Analysis EM algorithm p(z|D, θ) is the posterior of z P q(z) z q(z)log ( p(z|D,θ) ) is Kullback-Leibler (KL) divergence Thus, in E step we use the current values of parameters to calculate the posterior of z M step is then problem dependent Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 41 / 47 Probabilistic Latent Semantic Analysis EM algorithm for pLSA p(z|w , d) = = = Denis Helic (KTI, TU Graz) p(z, w , d) p(w , d) p(d)p(z|d)p(w |z) P z p(d)p(z|d)p(w |z) p(z|d)p(w |z) P z p(z|d)p(w |z) p(w |z) ∝ X p(z|d) ∝ X n(d, w )p(z|d, w ) d n(d, w )p(z|d, w ) w KDDM1 Jan 16, 2014 42 / 47 Probabilistic Latent Semantic Analysis Example IPython Notebook examples Slightly modified code from: https://github.com/hitalex/PLSA http: //kti.tugraz.at/staff/denis/courses/kddm1/plsa.ipynb Command Line ipython notebook –pylab=inline plsa.ipynb Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 43 / 47 Probabilistic Latent Semantic Analysis Example Movie User Joe Jim John Jack Jill Jenny Jane Denis Helic (KTI, TU Graz) Matrix Alien Star Wars Casablanca Titanic 1 3 4 5 0 0 0 1 3 4 5 0 0 0 1 3 4 5 0 0 0 0 0 0 0 4 5 2 0 0 0 0 4 5 2 KDDM1 Jan 16, 2014 44 / 47 Probabilistic Latent Semantic Analysis Example Movie User Joe Jim John Jack Jill Jenny Jane Denis Helic (KTI, TU Graz) Matrix Alien Star Wars Casablanca Titanic 1 3 4 5 0 0 0 1 3 4 5 2 0 1 1 3 4 5 0 0 0 0 0 0 0 4 5 2 0 0 0 0 4 5 2 KDDM1 Jan 16, 2014 45 / 47 Probabilistic Latent Semantic Analysis Example \segment 1" \segment 2" \matrix 1" \matrix 2" \line 1" \line 2" \power 1" power 2" imag speaker robust manufactur constraint alpha POWER load SEGMENT speech MATRIX cell LINE redshift spectrum memori texture recogni eigenvalu part match LINE omega vlsi color signal uncertainti MATRIX locat galaxi mpc POWER tissue train plane cellular imag quasar hsup systolic brain hmm linear famili geometr absorp larg input slice source condition design impos high redshift complex cluster speakerind. perturb machinepart segment ssup galaxi arrai mri SEGMENT root format fundament densiti standard present volume sound suci group recogn veloc model implement Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most probable words in the class-conditional distribution P (wjz ), from 2000 top to bottom in descending order. Figure: From Hofmann, Document 1, P fzk jd1 ; wj = `segment`g = (0:951; 0:0001; : : :) P fwj = `segment`jd1g = 0:06 SEGMENT medic imag challeng problem eld imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need applic involv estim boundari object classif tissu abnorm shape analysi contour detec textur SEGMENT despit exist techniqu SEGMENT specif medic imag remain crucial problem [...] Document 2, P fz jd ; wj = `segment`g = (0:025; 0:867; KDDM1 : : :) k 2 Denis Helic (KTI, TU Graz) Jan 16, 2014 46 / 47 Probabilistic Latent Semantic Analysis Performance The performance of a retrieval system based on this model (PLSI) found superior to that both the vectorand space based similarity In IRwas typically superior to ofboth: VSM LSA (cos) and a non-probabilistic latent semantic indexing (LSI) method. (We skip details here.) From Th. Hofmann, 2000 Figure: From Hofmann, 2000 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 47 / 47