Download Class Slides

CS246 Latent Dirichlet Analysis LSI  LSI uses SVD to find the best rank-K approximation   The result is difficult to interpret especially with negative numbers Q: Can we develop a more interpretable method? Theory of LDA (Model-based Approach)     Develop a simplified model on how users write a document based on topics. Fit the model to the existing corpus and “reverse engineer” the topics used in a document Q: How do we write a document? A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms Two Probability Vectors  For every document d, we assume that the user will first pick the topics to write about    P(z|d) : probability to pick topic z when the user write each T   word in document d.   P( zi | d )  1  i 1  Document-topic vector of d We also assume that every topic is associated with each term with certain probability  P(w|z) : the probability of picking the term w when the user   write on the topic z.   P( w | z )  1     Topic-term vector of z W  j 1 j Probabilistic Topic Model   There exists T number of topics The topics-term vector for each topic is set before any document is written   P(wj|zi) is set for every zi and wj Then for every document d,   The user decides the topics to write on, i.e., P(zi|d) For each word in d   The user selects a topic zi with probability P(zi|d) The user selects a word wj with probability P(wj|zi) Probabilistic Document Model P(w|z) P(z|d) 1.0 1 bank1 loan1 money DOC 1 bank1 money1 ... 0.5 Topic 1 0.5 1.0 Topic 2 1 river2 bank1 money DOC 2 stream2 bank2 ... 2 stream2 river2 river DOC 3 bank2 stream2 ... Example: Calculating Probability  z1 = {w1:0.8, w2:0.1, w3:0.1} z2 = {w1:0.1, w2:0.2, w3:0.7}  d’s topics are {z1: 0.9, z2:0.1} d has three terms {w32, w11, w21}.  Q: What is the probability that a user will write such a document? Corpus Generation Probability     T: # topics D: # documents M: # words per document Probability of generating the corpus C M  P(C )     P( wi , j | zi , j ) P( zi , j | d i )  i 1  j 1  D Generative Model vs Inference (1) P(w|z) P(z|d) 1.0 1 bank1 loan1 money DOC 1 bank1 money1 ... 0.5 Topic 1 0.5 1.0 Topic 2 1 river2 bank1 money DOC 2 stream2 bank2 ... 2 stream2 river2 river DOC 3 bank2 stream2 ... Generative Model vs Inference (2) ? ? bank? loan? money DOC 1 bank? money? ... ? Topic 1 ? ? Topic 2 ? river? bank? money DOC 2 stream? bank? ... ? stream? river? river DOC 3 bank? stream? ... Probabilistic Latent Semantic Index (pLSI)  Basic Idea: We pick P(zj|di), P(wk|zj), and zij values to maximize the corpus generation probability   Maximum-likelihood estimation (MLE) More discussion later on how to compute the P(zj|di), P(wk|zj), and zij values that maximize the probability Problem of pLSI  Q: 1M documents, 1000 topics, 1M words. 1000 words/doc. How much input data? How many variables do we have to estimate?  Q: Too much freedom. How can we avoid overfitting problem? A: Adding constraints to reduce degree of freedom  Latent Dirichlet Analysis (LDA)  When term probabilities are selected for each topic  Topic-term probability vector, (P(w1|zj), …, P(wW|zj)), is sampled randomly from Dirichlet distribution Dir( p1 ,...., pW ; 1 ,..., W )  ( j  j )  (  j  j ) W j p  j  1 j 1 When users select topics for a document  Document-topic probability vector, (P(z1|d), …, P(zT|d)), is sampled randomly from Dirichlet distribution Dir( p1 ,...., pT ; 1 ,...,  T )  ( j  j )  ( j j ) T j p  j j 1  1 What is Dirichlet Distribution?  Multinomial distribution  Given the probability pi of each event ei, what is the probability that each event ei occurs i times after n trial? (i  i  1) 1  k f (1 ,...,  k ; n, p1 ,..., pk )  k p1 ... pk  ( i  1) i 1   We assume pi’s. The distribution assigns i’s probability. Dirichlet distribution  “Inverse” of multinomial distribution: We assume i’s. The distribution assigns pi’s probability. (i ( i  1)) 1  k f ( p1 ,..., pk ;1 ,...,  k )  k p1 ... pk  ( i  1) i 1 Dirichlet Distribution f ( p1 ,..., pk ;1 ,...,  k )  (i ( i  1)) k  ( i 1  i  1) p11 ... pk k Q: Given 1, 2,…, k, what are the most likely p1, p2, pk values? Normalized Probability Vector and Simplex    T W i 1 j 1  P( zi | d )  1and  P( w | z )  1 Remember that When (p1, …, pn) satisfies p1 + … + pn = 1, they are on a “simplex plane” (p1, p2, p3) and their 2-simplex plane j Effect of  values p1 p3 (1 ,  2 ,  3 )  (1,1,1) p1 p2 p3 (1 ,  2 ,  3 )  (1,2,1) p2 Effect of  values p1 p3 (1 ,  2 ,  3 )  (1,2,1) p1 p2 p3 (1 ,  2 ,  3 )  (1,2,3) p2 Effect of  values p1 p3 p1 p2 (1 ,  2 ,  3 )  (1,1,1) p3 p2 (1 ,  2 ,  3 )  (5,5,5) Effect of  values p1 p1 p3 p3 p2 (1 ,  2 ,  3 )  (1,1,1) p2 (1 ,  2 ,  3 )  (5,5,5) Minor Correction f ( p1 ,..., pk ;1 ,...,  k )  (i ( i  1)) k  ( i  1) p11 ... pk k i 1 is not “standard” Dirichlet distribution. The “standard” Dirichlet Distribution formula: f ( p1 ,..., pk ; 1 ,...,  k )  (i  i ) k  ( ) p11 1... pk k 1 i i 1   Used non-standard to make the connection to multinomial distribution clear From now on, we use the standard formula Back to LDA Document Generation Model  For each topic z   Pick the word probability vector P(wj|z)’s by taking a random sample from Dir(1,…, W) For every document d   The user decides its topic vector P(zi|d)’s by taking a random sample from Dir(1,…, T) For each word in d    The user selects a topic z with probability P(z|d) The user selects a word w with probability P(w|z) Once all is said and done, we have    P(wj|z): topic-term vector for each topic P(zi|d): document-topic vector for each document Topic assignment to every word in each document Symmetric Dirichlet Distribution   In principle, we need to assume two vectors, (1,…, T) and (1 ,…, W) as input parameters. In practice, we often assume all i’s are equal to  and all i’s =     Use two scalar values  and , not two vectors. Symmetric Dirichlet distribution Q: What is the implication of this assumption? Effect of  value on Symmetric Dirichlet p1 p3 p3  2   p1 p2  6 p2 Q: What does it mean? How will the sampled document topic vectors change as  grows? Common choice:  = 50/T,   200/W Plate Notation  P(z|d) z  P(w|z) T w M N LDA as Topic Inference  Given a corpus d1: w11, w12, …, w1m … dN: wN1, wN2, …, wNm    Find P(z|d), P(w|z), zij that are most “consistent” with the given corpus Q: How can we compute such P(z|d), P(w|z), zij? The best method so far is to use Monte Carlo method together with Gibbs sampling Monte Carlo Method (1)   Class of methods that compute a number through repeated random sampling of certain event(s). Q: How can we compute Pi? Monte Carlo Method (2) 1. 2. 3. 4.  Define the domain of possible events Generate the events randomly from the domain using a certain probability distribution Perform a deterministic computation using the events Aggregate the results of the individual computation into the final result Q: How can we take random samples from a particular distribution? Gibbs Sampling  Q: How can we take a random sample x from the distribution f(x)?  Q: How can we take a random sample (x, y) from the distribution f(x, y)?  Gibbs sampling   Given current sample (x1, …, xn), pick an axis xi, and take a random sample of xi value assuming all other (x1, …, xn) values In practice, we iterative over xi’s sequentially Markov-Chain Monte-Carlo Method (MCMC)  Gibbs sampling is in the class of Markov Chain sampling   Next sample depends only on the current sample Markov-Chain Monte-Carlo Method  Generate random events using Markov-Chain sampling and apply Monte-Carlo method to compute the result Applying MCMC to LDA    Let us apply Monte Carlo method to estimate LDA parameters. Q: How can we map the LDA inference problem to random events? We first focus on identifying topics {zij} for each word {wij}.  Event: Assignment of the topics {zij} to wij’s. The assignment should be done according to P({zij}|C)  Q: How to sample according to P({zij}|C)? Q: Can we use Gibbs sampling? How will it work?  Q: What is P(zij|{z-ij},C)?   P( zij  t | z ij , C )     n     ndit     wij t    P( zij  t | zij , C )  W   T    (nwt   )   (nditk   )   w1  k 1     nwt: how many times the word w has been assigned to the topic t ndt: how many words in the document d have been assigned to the topic t Q: What is the meaning of each term? LDA with Gibbs Sampling  For each word wij  Assign to topic t with probability  For the prior topic t of wij, decrease nwt and ndt by 1 For the new topic t of wij, increase nwt and ndt by 1   Repeat the process many times       n     n   w t dit ij    W T    ( n   ) ( n   )   wt   dit k   w1  k 1  At least hundreds of times Once the process is over, we have   zij for every wij nwt and ndt P( w | t )  nwt   W  (n i 1 wi t  ) P(t | d )  ndt   T  (n k 1 dtk ) Result of LDA (Latent Dirichlet Analysis)  TASA corpus   37,000 text passages from educational materials collected by Touchstone Applied Science Associates Set T=300 (300 topics) Inferred Topics Word Topic Assignments LDA Algorithm: Simulation  Two topics: River, Money Five words: “river”, “stream”, “bank”, “money”, “loan” River Money  river stream bank 1/3 1/3 1/3 1/3 money loan 1/3 1/3 Generate 16 documents by randomly mixing the two topics and using the LDA model Generated Documents and Initial Topic Assignment before Inference First 6 and the last 3 documents are purely from one topic. Others are mixture White dot: “River”. Black dot: “Money” Topic Assignment After LDA Inference First 6 and the last 3 documents are purely from one topic. Others are mixture After 64 iterations Inferred Topic-Term Matrix  Model parameter River river stream bank 0.33 0.33 0.33 Money  loan 0.33 0.33 0.33 money loan 0.29 0.39 Estimated parameter River Money  money river stream bank 0.25 0.4 0.35 0.32 Not perfect, but very close especially given the small data size SVD vs LDA Both perform the following decomposition topic term   X doc doc = term topic  SVD views this as matrix approximation LDA views this as probabilistic inference based on a generative model  Each entry corresponds to “probability”: better interpretability LDA as Soft Classfication   Soft vs hard clustering/classification After LDA, every document is assigned to a small number of topics with some weights   Documents are not assigned exclusively to a topic Soft clustering Summary  Probabilistic Topic Model     Generative model of documents pLSI and overfitting LDA, MCMC, and probabilistic interpretation Statistical parameter estimation    Multinomial distribution and Dirichlet distribution Monte Carlo method Gibbs sampling  Markov-Chain class of sampling

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Class Slides