Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17 Outline • Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. • Graphical Model – – – – Basic concepts in probabilistic machine learning EM pLSA LDA • Two Applications – Document decomposition for “long query” retrieval – Modeling Threaded Discussions What Is Matrix Decomposition? • We wish to decompose the matrix A by writing it as a product of two or more matrices: An×m = Bn×kCk×m • Suppose A, B, C are column matrices – An×m = (a1, a2, …, am), each ai is a n-dim data sample – Bn×k = (b1, b2, …, bk), each bj is a n-dim basis, and space B consists of k bases. – Ck×m = (c1, c2, …, cm), each ci is the k-dim coordinates of ai projected to space B Why We Need Matrix Decomposition? • Given one data sample: a1 = Bn×kc1 (a11, a12, …, a1n)T = (b1, b2, …, bk) (c11, c12, …, c1k)T • Another data sample: a2 = Bn×kc2 • More data sample: am = Bn×kcm • Together (m data samples): (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m Why We Need Matrix Decomposition? (a1, a2, …, am) = Bn×k (c1, c2, …, cm) An×m = Bn×kCk×m • We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. • In general, B captures the common features in A, while C carries specific characteristics of the original samples. • • • • In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors In LDA: B is discriminant directions In NMF: B is local features PRINCIPLE COMPONENT ANALYSIS Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector. Definition – Principle Component Analysis – Principle Component Analysis (PCA) – Karhunen-Loeve transformation (KL transformation) • Let A be a n × m data matrix in which the rows represent data samples • Each row is a data vector, each column represents a variable • A is centered: the estimated mean is subtracted from each column, so each column has zero mean. • Covariance matrix C (m x m): Principle Component Analysis • C can be decomposed as follows: C=UΛUT • Λ is a diagonal matrix diag(λ1 λ2,…,λn), each λi is an eigenvalue • U is an orthogonal matrix, each column is an eigenvector UTU=I U-1=UT Maximizing Variance • The objective of the rotation transformation is to find the maximal variance • Projection of data along w is Aw. • Variance: σ2w= (Aw)T(Aw) = wTATAw = wTCw where C = ATA is the covariance matrix of the data (A is centered!) • Task: maximize variance subject to constraint wTw=1. Optimization Problem • Maximize λ is the Lagrange multiplier • Differentiating with respect to w yields • Eigenvalue equation: Cw = λw, where C = ATA. • Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found. Property: Data Decomposition • PCA can be treated as data decomposition a=UUTa =(u1,u2,…,un) (u1,u2,…,un)T a =(u1,u2,…,un) (<u1,a>,<u2,a>…,<un,a>)T =(u1,u2,…,un) (b1, b2, …, bn)T = Σ bi·ui Face Recognition – Eigenface • Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) • The eigenface approach – – – – images are points in a vector space use PCA to reduce dimensionality face space compare projections onto face space to recognize faces PageRank – Power Iteration • Column j has nonzero elements in positions corresponding to outlinks of j (Nj in total) • Row i has nonzero element in positions corresponding to inlinks Ii. Column-Stochastic & Irreducible • Column-Stochastic • where • Irreducible Iterative PageRank Calculation • For k=1,2,… • Equivalently (λ=1, A is a Markov chain transition matrix) • Why can we use power iteration to find the first eigenvector? Convergence of the power iteration • Expand the initial approximation r0 in terms of the eigenvectors SINGULAR VALUE DECOMPOSITION SVD - Definition • Any m x n matrix A, with m ≥ n, can be factorized • Singular Values And Singular Vectors • The diagonal elements σj of are the singular values of the matrix A. • The columns of U and V are the left singular vectors and right singular vectors respectively. • Equivalent form of SVD: Matrix approximation • Theorem: Let Uk = (u1 u2 … uk), Vk = (v1 v2 … vk) and Σk = diag(σ1, σ2, …, σk), and define • Then • It means, that the best approximation of rank k for the matrix A is SVD and PCA • We can write: • Remember that in PCA, we treat A as a row matrix • V is just eigenvectors for A – Each column in V is an eigenvector of row matrix A – we use V to approximate a row in A • Equivalently, we can write: • U is just eigenvectors for AT – Each column in U is an eigenvector of column matrix A – We use U to approximate a column in A Example - LSI • Build a term-by-document matrix A • Compute the SVD of A: • Approximate A by A = UΣVT – Uk: Orthogonal basis, that we use to approximate all the documents – Dk: Column j hold the coordinates of document j in the new basis – Dk is the projection of A onto the subspace spanned by Uk. SVD and PCA • For symmetric A, SVD is closely related to PCA • PCA: A = UΛUT – U and Λ are eigenvectors and eigenvalues. • SVD: A = UΛVT – U is left(column) eigenvectors – V is right(row) eigenvectors – Λ is the same eigenvalues • For symmetric A, column eigenvectors equal to row eigenvectors • Note the difference of A in PCA and SVD – SVD: A is directly the data, e.g. term-by-document matrix – PCA: A is covariance matrix, A=XTX, each row in X is a sample Latent Semantic Indexing (LSI) 1. Document file preparation/ preprocessing: – Indexing: collecting terms – Use stop list: eliminate ”meaningless” words – Stemming 2. Construction term-by-document matrix, sparse matrix storage. 3. Query matching: distance measures. 4. Data compression by low rank approximation: SVD 5. Ranking and relevance feedback. Latent Semantic Indexing • Assumption: there is some underlying latent semantic structure in the data. • E.g. car and automobile occur in similar documents, as do cows and sheep. • This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD. Similarity Measures • Term to term AAT = UΣ2UT = (UΣ)(UΣ)T UΣ are the coordinates of A (rows) projected to space V • Document to document ATA = VΣ2VT = (VΣ)(VΣ)T VΣ are the coordinates of A (columns) projected to space U Similarity Measures • Term to document A = UΣVT = (UΣ½)(VΣ½)T UΣ½ are the coordinates of A (rows) projected to space V VΣ½ are the coordinates of A (columns) projected to space U HITS (Hyperlink Induced Topic Search) • Idea: Web includes two flavors of prominent pages: – authorities contain high-quality information, – hubs are comprehensive lists of links to authorities • A page is a good authority, if many hubs point to it. • A page is a good hub if it points to many authorities. • Good authorities are pointed to by good hubs and good hubs point to good authorities. Hubs Authorities Power Iteration • Each page i has both a hub score hi and an authority score ai. • HITS successively refines these scores by computing • Define the adjacency matrix L of the directed web graph: • Now HITS and SVD • L: rows are outlinks, columns are inlinks. • a will be the dominant eigenvector of the authority matrix LTL • h will be the dominant eigenvector of the hub matrix LLT • They are in fact the first left and right singular vectors of L!! • We are in fact running SVD on the adjacency matrix. HITS vs PageRank • PageRank may be computed once, HITS is computed per query. • HITS takes query into account, PageRank doesn’t. • PageRank has no concept of hubs • HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. • PageRank more stable, because of its random jump step. NMF – NON-NEGATIVE MATRIX FACTORIZATION Definition • Given a nonnegative matrix Vn×m, find non-negative matrix factors Wn×k and Hk×m such that Vn×m≈Wn×kHk×m • V: column matrix, each column is a data sample (n-dimension) • Wi: k-basis represents one base • H: coordinates of V projected to W vj ≈ Wn×khj Motivation • Non-negativity is natural in many applications... • Probability is also non-negative • Additive model to capture local structure Multiplicative Update Algorithm • Cost function Euclidean distance • Multiplicative Update Multiplicative Update Algorithm • Cost function Divergence – Reduce to Kullback-Leibler divergence when – A and B can be regarded as normalized probability distributions. • Multiplicative update • PLSA is NMF with KL divergence NMF vs PCA • n = 2429 faces, m = 19x19 pixels • Positive values are illustrated with black pixels and negative values with red pixels • NMF Parts-based representation • PCA Holistic representations Reference • D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001. • D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999). Major Reference • Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend) Outline • Basic concepts – – – – Likelihood, i.i.d. ML, MAP and Bayesian Inference Expectation-Maximization Mixture Gaussian Parameter estimation • pLSA – Motivation – Derivation & Geometry properties – Applications • LDA – – – – – Motivation - Why to add a hyper parameter Dirichlet Distribution Variational EM Relations with other topic modals Incorporating category information • Summary Not Included • General graphical model theories • Markov random field (belief propagation) • Detailed derivation of LDA BASIC CONCEPTS What Is Machine Learning? Data • Let x = (x1, x2, . . . , xD) T denote a data point, and D = {x(1), x(2) . . . , x(N)}, a data set. D is sometimes associated with desired outputs y1, y2, . . .. Predictions • We are generally interested in predicting something based on the observed data set. • Given D what can we say about x(N+1)? Model • To make predictions, we need to make some assumptions. We can often express these assumptions in the form of a model, with some parameters, θ • Given data D, we learn the model parameters , from which we can predict new data points. • The model can often be expressed as a probability distribution over data points Likelihood Function • Given a set of parameter values, probability density function (PDF) will show that some data are more probable than other data. • Inversely, given the observed data and a model of interest, Likelihood function is defined as: L(θ) = fθ(x|θ) = p(x|θ) • That is, likelihood function L(θ) will show that some parameters are more likely to have produced the data Maximum Likelihood (ML) • Maximum likelihood will find the best model parameters that make the data “most likely” generated from this model. • Suppose we are given n data samples (x1, x2, …, xn), • Maximum likelihood will find θ that maximize L(θ) • Predictive distribution I.I.D. – Independent, Identically Distributed • I.I.D. means • The problem is considerably simplified as: • Usually, log likehood is used Reference • Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Introduction to Machine Learning, Lectures 1-2 Slides) • Gregor Heinrich, Parameter estimation for text analysis, Technical Note, 2005-2008 EXPECTATION MAXIMIZATION Why We Need EM? • The Expectation–Maximization (EM) algorithm is a method for ML learning of parameters in latent variable models. • Why we need latent variables? • To describe complex model Gaussian Mixture Model • To discover the intrinsic structure inside a data set Topic Models, such as pLSA, LDA. More General • Data set • Likelihood: , • Goal: learn maximum likelihood (ML) parameter values • The maximum likelihood procedure finds parameters θ such that: • Because of the integral (or sum) over latent variables, the likelihood can be a very complicated, and hard to optimize function. The Expectation Maximization (EM) Algorithm • The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps: • E step: Fill in values of latent variables according to posterior given data. • M step: Maximize likelihood as if latent variables were not hidden. • Decomposes difficult problems into series of tractable steps. Jensen’s Inequality Lower Bounding the Log Likelihood • Observed data D = {yn}; Latent variables X = {xn}; Parameters θ. • Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: • Any distribution, q(X), over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen’s inequality: • where H[q] is the entropy of q(X). The E and M Steps of EM • The lower bound on the log likelihood is given by: • EM alternates between: • E step: optimize wrt distribution over hidden variables holding params fixed: • M step: maximize distribution fixed: wrt parameters holding hidden The E Step • E step: for fixed θ, • The second term is the Kullback-Leibler divergence. • This means that, for fixed θ, F is bounded above by L, and achieves that bound when • So, the E step simply sets The M Step • M step: maximize distribution q fixed: wrt parameters holding hidden • The second equality comes from fact that entropy of q(X) does not depend directly on θ. • The specific form of the M step depends on the model. Often the maximum wrt θ can be found analytically. EM Never Decreases the Likelihood • The E and M steps together never decrease the log likelihood: • The E step brings F(q, θ) to the likelihood L(θ). 顶 • The M-step maximizes F(q, θ) wrt θ. 抬 • F(q, θ) < L(θ) by Jensen – or, equivalently, from the nonnegativity of KL Reference • Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Unsupervised learning, Lecture 5 Slides) • Christopher M. Bishop (2006) Pattern Recognition and Machine Learning. Springer WHY DO WE NEED GRAPHICAL MODEL? Why Do We Need Graphical Models? • Cons – Graphical model is so complex, even with a few circles… – We have to make too many assumptions. • Pros – We do need probability to explain our world. But joint probability is hard to compute. – Graphical model can help us analyze and understand our problems. – Graphs are an intuitive way of representing and visualizing the relationships between many variables. – With a graphical model, we can decouple joint probability to conditional probabilities, which are usually easier. Directed Acyclic Graphical Models (Bayesian Networks) • A DAG Model / Bayesian network corresponds to a factorization of the joint probability distribution: p(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|B,C)p(E|C,D) • In general • where pa(i) are the parents of node i. Directed Graphs for Statistical Models: Plate Notation • A data set of N points generated from a Gaussian: PLSA – PROBABILISTIC LATENT SEMANTIC ANALYSIS Latent Semantic Indexing (LSI) Review • For natural Language Queries, simple term matching does not work effectively – Ambiguous terms – Same Queries vary due to personal styles • Latent semantic indexing – Creates this ‘latent semantic space’ (hidden meaning) • LSI puts documents together even if they don’t have common words, if the docs share frequently co-occurring terms • Disadvantages: – Statistical foundation is missing pLSA – Probabilistic Latent Semantic Analysis • Automated Document Indexing and Information retrieval • Identification of Latent Classes using an Expectation Maximization (EM) Algorithm • Shown to solve – Polysemy • Java could mean “coffee” and also the “PL Java” • Cricket is a “game” and also an “insect” – Synonymy • “computer”, “pc”, “desktop” all could mean the same • Has a better statistical foundation than LSA pLSA d d z w z1 z2 z3 w1 w2 w3 … zN wN Nd M M z1, …, zN are variables. ziє[1,K]. K is the number of latent topics. pLSA d1 z1 z2 w1 w2 d2 … zN1 z1 z2 wN1 w1 w2 dM … zN2 z1 z2 wN2 w1 w2 … zNm wNm p(w|z=1), p(w|z=2), p(w | z=Nm) are shared for all documents. Likelihood: Joint Probability vs Likelihood • Joint probability • Likelihood (only for observed variables) • p(d) is assumed to be uniform Document Decomposition • Each document can be decomposed as: • This is similar to the matrix decomposition, if we consider each discrete distribution as a vector. p(w|d) = ZV×k p(z|d) • With many documents, we hope to find latent topics as common basis. pLSA – Objective Function • pLSA tries to maximize the log likelihood: • Due to the summation over z inside log, we have to resort to EM. EM Steps • E-Step – Expectation of the likelihood function is calculated with the current parameter values • M-Step – Update the parameters with the calculated posterior probabilities – Find the parameters that maximizes the likelihood function Lower Bounding the Log Likelihood EM Steps • The E-Step: • The M-Step: Latent Subspace pLSA vs LSA • LSA and PLSA perform dimensionality reduction – In LSA, by keeping only K singular values – In PLSA, by having K aspects • Comparison to SVD – U Matrix related to P(z|d) (doc to aspect) – V Matrix related to P(w|z) (aspect to term) – E Matrix related to P(z) (aspect strength) pLSA vs LSA • The main difference is the way the approximation is done • PLSA generates a model (aspect model) and maximizes its predictive power • Selecting the proper value of K is heuristic in LSA • Model selection in statistics can determine optimal K in PLSA Applications • Text mining / topic discovering • Scene Classification Text Mining Scene Classification Classification Result Reference • Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999 • Bosch, A. , Zisserman, A. and Munoz, X. Scene Classification via pLSA Proceedings of the European Conference on Computer Vision (2006) • Sivic, J. , Russell, B. C. , Efros, A. A. , Zisserman, A. and Freeman, W. T., Discovering Object Categories in Image Collections, MIT AI Lab Memo AIM-2005-005, February, 2005. LDA – LATENT DIRICHILET ALLOCATION Problems in pLSA • pLSA provides no probabilistic model at the document level. Each doc has its own topic mixture proportion. • The number of parameters in the model grows linearly with M (the number of documents in the training set). Problems in pLSA • There is no constraint for distributions p(z|di). p(z|d1) p(z|d2) d2 d1 z1 z2 w1 w2 p(z|dm) … zN1 z1 z2 wN1 w1 w2 dm … zN2 z1 z2 wN2 w1 w2 • Easy to lead to serious problems with over-fitting. … zNm wNm Dirichlet Distribution • In the LDA model, the topic mixture proportions for each document are assumed to follow some distribution. • Requirement for such a distribution: – The samples (mixture proportions) generated from it are K-tuples of non-negative numbers that sum to one. That is, the samples are multinormials. – Easy to optimize. • Dirichlet Distribution is one of such distributions. • The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex. Dirichlet Distribution • Definition: Γ( ik1i ) k i 1 p( x1 , x2 ,..., xk | 1 , 2 ,..., k ) k i 1 xi i 1 Γ(i ) s.t. xi 0, k x 1 i 1 i • The density is zero outside this open (K − 1)-dimensional simplex. Example Dirichlet Distributions (K=3) • Various parameter α. (6, 2, 2) (3, 7, 5) (2, 3, 4) (6, 2, 6) Example Dirichlet Distributions (K=3) • Equal αi, different 0 i 1i k α0=0.1 α0=1 α0=10 The LDA Model z1 z2 z3 z4 z1 z2 z3 z4 z1 z2 z3 z4 w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4 b • For each document, • Choose ~Dirichlet() • For each of the N words wn: – Choose a topic zn» Multinomial() – Choose a word wn from p(wn|zn,b), a multinomial probability conditioned on the topic zn. The LDA Model • For each document, • Choose ~Dirichlet() • For each of the N words wn: – Choose a topic zn» Multinomial() – Choose a word wn from p(wn|zn,b), a multinomial probability conditioned on the topic zn. Joint Probability • Given parameter α and β where Likelihood • Joint Probability • Marginal distribution of a document • Likelihood over all the documents Inference • The likelihood can be computed by summing each document. • Jansen’s inequality in EM. Inference • In E-Step, we need to compute the posterior distribution of the hidden variables. • Unfortunately, this distribution is intractable to compute in general. • We have to resort to variational approach Variational Inference • In variational inference, we consider a simplified graphical model with variational parameters , and minimize the KL Divergence between the variational and posterior distributions. Variantional Inference • The difference between the lower bound and the likelihood is the KL divergence. • Maximizing the lower bound L(,;,b) with respect to and is equivalent to minimizing the KL divergence. VBEM vs EM • Only different in the E-Step. • In standard EM, q(X) is directly set to p(X|D,θ), and let KL=0. • In VBEM, it is intractable to compute p(X|D,θ). Instead, it approximates p(X|D,θ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D, θ). • This is also equivalent to maximizing the lower bound L(θ). Parameter Estimation • Given a corpus of documents, we would like to find the parameters and b which maximize the likelihood of the observed data. • Strategy (Variational EM): • Lower bound log p(w|,b) by a function L(,;,b) • Repeat until convergence: – E: Maximize L(,;,b) with respect to the variational parameters ,. – M: Maximize the bound with respect to parameters and b. Parameter Estimation • E-Step: Variational Inference – repeat until convergence. : : • M-Step: Parameter estimation β: α: can be implemented using the Newton-Raphson method. Topic Examples in a 100-topic LDA Model) • 16,000 documents from a subset of the TREC AP corpus. Classification (50-topic LDA + SVM) • Reuters-21578 dataset – contains 8000 documents and 15,818 words. (a) EARN vs. NOT EARN. (b) GRAIN vs. NOT GRAIN. Problems in LDA • Dirichlet Distribution is helpful to avoid over-fitting. But the assumption might be too strong. z1 z2 z3 z4 z1 z2 z3 z4 z1 z2 z3 z4 w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4 b A Bayesian Hierarchical Model for Learning Natural Scene Categories • Incorporating category information θ π z x β Nd M Codebook • 174 Local Image Patches • Detection: Evenly Sampled Grid Random Sampling Saliency Detector Lowe’s DoG Detector • Representation: Normalized 11x11 gray values 128-dim SIFT Topic Distribution in Different Categories Topic Hierarchical Clustering More Topic Models • Dynamic topic models, ICML 2006 • Correlated Topic Model, NIPS 2005 • Hierarchical Dirichlet Process, Journal of the American Statistical Association 2003 • Nonparametric Bayes pachinko allocation, UAI 2007 • Supervised LDA, NIPS 2007 • MedLDA – Maximum Margin Discrimant LDA, ICML 2009 • … Are you really into Graphical Models? • Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005. Reference • David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research (JMLR), 2003 • Matthew J. Beal. Variational Algorithms for Approximate Bayesian Inference, PhD. Thesis, University of Cambridge, 1998 • L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, 2005. Outline • Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. • Graphical Model – – – – Basic concepts in probabilistic machine learning EM pLSA LDA • Two Applications – Document decomposition for “long query” retrieval, ICCV 2009 – Modeling Threaded Discussions, SIGIR 2009 Large-Scale Indexing for “Long Query” Retrieval (Similarity Search) Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma, and Heung-Yeung Shum ICCV 2009 The Long Query Problem • If a query contains 1000 keywords: – Need to access 1000 inverted lists – The intersection of 1000 inverted lists may be empty – The union of 1000 inverted list may be the whole corpus • Dimension reduction? Dim = 1 million Img1 Term1 Term2 Term3 Term4 … TermN 1 2 0 0 … 2 Topic Projection Img1 f1 f2 … fM 0.2 0.1 … 0.03 Dim = 200 Key Idea: Dimension Reduction + Residual Error Preservation p Xw 16 14 12 An image = + 10 8 6 4 a few words (10 words) 2 0 • • • • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p : original TF-IDF vector in vocabulary space X : projection matrix for dimension reduction w: low dimensional feature vector : residual error Orthogonal Decomposition p Xw Low dimensional representation x1k p1 x11 1 p x w x 2k 1 2 21 2 X1, X2, X3, …, Xk pW 1 wk W 1 pW xW 1 W xWk Base vector Residual 16 14 12 An image = + 10 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 a few words (10 words) p, q wTp wq Tp q A Probabilistic Implementation C. Chemudugunta, et.al. Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model, NIPS 2006 x is a switch variable. It controls a word generated from: • a topic specific distribution • a document specific distribution • a background distribution p ( w | d ) p( x 0 | d )k 1 p(w | z k ) p( z k | d ) K p( x 1| d ) p( w | , d ) p( x 2 | d ) p( w | ) Search (Online) 16 14 A query = + 12 10 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Doc 1 DS1 DS2 … Doc 2 DS1 DS2 … Doc 300 DS1 DS2 … Doc 401 DS1 DS2 … a few words … … Doc Meta … … Doc N Doc 401 DS1 Index 10M Images: 4.6GB Search Speed: < 100ms LSH Index … DS2 … Doc 300 … Re-ranking Doc 300 Doc 401 … Search Example Query Image Search Example Query Image Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications Chen LIN, Jiang-Ming YANG, Rui CAI, Xin-jing WANG, Wei WANG, Lei ZHANG SIGIR 2009 123 Semantic & structure Semantic: Topics Structure: Who reply to who 124 Optimize them together Model semantic Model structure 125 Reply reconstruction Document Similarity Topic Similarity 126 Structure Similarity Baselines NP Reply to Nearest Post RR Reply to Root DS Document Similarity LDA Latent Dirichlet Allocation Project documents to topic space SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space 127 Evaluation method Slashdot Apple All Posts Good Posts All Posts Good Posts NP 0.021 0.012 0.289 0.239 RR 0.183 0.319 0.269 0.474 DS 0.463 0.643 0.409 0.628 LDA 0.465 0.644 0.410 0.648 SWB 0.463 0.644 0.410 0.641 SMSS 0.524 0.737 0.517 0.772 128 Expert finding Methods HITS PageRank … Expert finding Network construction Reply reconstruction 129 Baselines LM Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06 Achieves stable performance in expert finding task using a language model PageRank Benchmark nodal ranking method HITS Find hub nodes and authority node EABIF Personalized Recommendation Driven by Information Flow. SIGIR ’06 Find most influential node 130 Evaluation • Bayesian estimate Method MRR MAP P@10 LM 0.821 0.698 0.800 EABIF(ori.) 0.674 0.362 0.243 EABIF(rec.) 0.742 0.318 0.281 PageRank(ori.) 0.675 0.377 0.263 PageRank(rec.) 0.743 0.321 0.266 HITS(ori.) 0.906 0.832 0.900 HITS(rec.) 0.938 0.822 0.906 131 Summary • Matrix and probability are fundamental mathematics in information retrieval and computer vision. – Matrix decomposition – a good practice to learn matrix. – Graphical model – a good practice to learn probability. • Graphical model is a good tool to analyze problems • The essence of decomposition is to discover a set of mid-level features to describe original documents/images. • It is more adaptable for various applications than matrix decomposition.