Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COLING 2012 Tutorial December 08, 2012 Revisiting Dimensionality Reduction Techniques for NLP Jagadeesh Jagarlamudi Raghavendra Udupa University of Maryland Microsoft Research India Road Map • Introduction • NLP and Dimensionality Reduction • Mathematical Background • Data with Single View • Techniques • Applications • Advanced Topics • Data with Multiple Views • Techniques • Applications • Advanced Topics • Summary Road Map • Introduction • NLP and Dimensionality Reduction • Mathematical Background • Data with Single View • Techniques • Applications • Advanced Topics • Data with Multiple Views • Techniques • Applications • Advanced Topics • Summary NLP and Dimensionality Reduction Dimensionality Reduction: Motivation • Many applications involve high dimensional (and often sparse) data • High dimensional data poses several challenges – Computational – Difficulty of Interpretation – Over Fitting • However, data often lies (approximately) in a low dimensional manifold embedded in high dimensional manifold Dimensionality Reduction: Goal • Given high dimensional data, discover the underlying low dimensional structure 560 dimensional data 2D Embedding He et al, Face Recognition Using LaplacianFaces Dimensionality Reduction: Benefits • Computational Efficiency – K-Nearest Neighbor Search • Data Compression – Less storage; millions of data points in RAM • Data Visualization – 2D and 3D Scatter Plots • Latent Structure and Semantics • Feature Extraction – Removing distracting variance from data sets Dimensionality Reduction: Techniques • Projective Methods – find low dimensional projections that extract useful information from the data, by maximizing a suitable objective function – PCA, ICA, LDA • Manifold Modeling Methods – find low dimensional subspace that best preserves the manifold structure in the data, by modelling the manifold structure – LLE, Isomap, Laplacian Eigenmaps Dimensionality Reduction: Relevance to NLP • High dimensional data in NLP – Text Documents – Context Vectors • How can Dimensionality Reduction help? – ‘Semantic’ similarity of documents – Correlate semantically related terms – Crosslingual similarity Mathematical Background Linear Transformation Linear Transformation: Illustration Data Centering • Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛 • Mean: 𝜇 = 1 𝑛 𝑛 𝑖=1 𝑥𝑖 • Centering: 𝑥𝑖 = 𝑥𝑖 − 𝜇 • Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 • Mean after centering: 𝜇= 1 𝑛 𝑛 𝑖=1 𝑥𝑖 = 1 𝑛 𝑛 𝑖=1 𝑥𝑖 − 𝜇 = 𝜇 − 𝜇 = 0 • Mean after linear transformation: 1 𝑛 1 𝑛 𝑖=1 𝐴𝑥𝑖 = 𝑖=1 𝐴 𝑥𝑖 − 𝜇 = 𝐴𝜇 − 𝐴𝜇 = 0 𝑛 𝑛 Data Variance • Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛 • Centered: 𝜇 = • Variance: 1 𝑛 where 𝐶𝑋 = 1 𝑛 𝑛 𝑖=1 𝑥𝑖 𝑛 𝑖=1 𝑥𝑖 1 𝑋𝑋 𝑇 𝑛 2 Centering doesn’t change data variance =0 1 = 𝑇𝑟 𝑋𝑋 𝑇 𝑛 = 𝑇𝑟 𝐶𝑋 (sample covariance) • Transformed dataset: 𝐴𝑋 • Variance after transformation: 1 𝑛 1 2 𝑇 𝑇 𝑇 𝐴𝑥 = 𝑇𝑟 𝐴𝑋𝑋 𝐴 = 𝑇𝑟 𝐴𝐶 𝐴 𝑖 𝑋 𝑖=1 𝑛 𝑛 Positive Definite Matrices • Real: 𝑀 ∈ 𝑅𝑝×𝑞 • Square: 𝑝 = 𝑞 • Symmetric: 𝑀𝑖𝑗 = 𝑀𝑗𝑖 • Positive: 𝑥 𝑇 𝑀𝑥 > 0 for all 𝑥 ≠ 0 • Examples: – Identity Matrix 1 1 – 1 5 – 𝐶𝑋 , 𝐴𝐶𝑋 𝐴𝑇 • Cholesky decomposition: 𝑀 = 𝐿𝐿𝑇 Eigenvalues and Eigenvectors • 𝑀 ∈ 𝑅𝑝×𝑝 • 𝑀𝑢 = 𝜆𝑢 where 𝑢 is a vector and 𝜆 is a scalar – eigenvector 𝑢, eigenvalue 𝜆 – {𝜆𝑖 } eigenvalues of M 𝑝 𝑖=1 𝑀𝑖𝑖 • Trace: 𝑇𝑟 𝑀 = = 𝑖 𝜆𝑖 • Rank: Number of non-zero eigenvalues Eigensystem of Positivedefinite Matrices • 𝑀 ∈ 𝑅𝑝×𝑝 • Positive eigenvalues: 𝜆𝑗 > 0 • Real valued eigenvectors: 𝑢𝑗 ∈ 𝑅𝑝 • Orthonormal eigenvectors: 𝜆𝑖 ≠ 𝜆𝑗 ⇒ 𝑇 0 and 𝑢𝑖 𝑢𝑖 = 1 (i. e. 𝑈 𝑇 𝑈 = 𝐼) • Full rank: 𝑅𝑎𝑛𝑘 𝑀 = 𝑝 • Eigen decomposition: 𝑀 = 𝑈Λ𝑈 𝑇 𝑇 𝑢𝑖 𝑢𝑗 = Data Variance and Eigenvalues • Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑 • Data variance: 1 𝑛 𝑛 𝑖=1 1 𝑛 𝑛 2 𝑥 𝑖=1 𝑖 𝑥𝑖 2 = 𝑇𝑟 𝐶𝑋 • Eigen decomposition: 𝐶𝑋 = 𝑈Λ𝑈 𝑇 • Data variance: = 𝑇𝑟 𝐶𝑋 = 𝑖 𝜆𝑖 Road Map • Introduction • NLP and Dimensionality Reduction • Mathematical Background • Data with Single View • Techniques • Applications • Advanced Topics • Data with Multiple Views • Techniques • Applications • Advanced Topics • Summary Data with Single View: Techniques • Principal Components Analysis • Singular Value Decomposition • Oriented Principal Components Analysis Data with Single View: Techniques • Principal Components Analysis • Singular Value Decomposition • Oriented Principal Components Analysis Principal Components Analysis (PCA) • Centered Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 𝑥𝑖 ∈ 𝑅𝑑 • Goal: Find orthonormal linear transformation 𝑇: 𝑅𝑑 → 𝑅 𝑘 that maximizes data variance 𝑇 𝑥 = 𝐴𝑥 𝐴𝐴𝑇 = 𝐼 𝑇𝑟 𝐴𝐶𝑋 𝐴𝑇 Linear transformation Orthonormal basis Data variance • Mathematical formulation: 𝐴∗ = argmax 𝑇𝑟 𝐴𝐶𝑋 𝐴𝑇 𝐴∈𝑅𝑘×𝑑 𝐴𝐴𝑇 =𝐼 PCA: Solution • Eigen decomposition of 𝐶𝑋 : 𝐶𝑋 = 𝑈Λ𝑈 𝑇 • 𝑈 = 𝑢1 , 𝑢2 , … , 𝑢𝑑 • Λ = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑑 • 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑑 • 𝐴 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 • 𝑇 𝑥 = 𝐴𝑥 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑥 • MATLAB function: 𝑝𝑟𝑖𝑛𝑐𝑜𝑚𝑝() Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics PCA: Solution (contd.) • Data variance after transformation: 𝐴𝑋 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑋 𝐴𝑋𝑋 𝑇 𝐴𝑇 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑋𝑋 𝑇 𝑢1 , 𝑢2 , … , 𝑢𝑘 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑈Λ𝑈 𝑇 𝑢1 , 𝑢2 , … , 𝑢𝑘 = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑘 𝑇𝑟 𝐴𝑋𝑋 𝑇 𝐴𝑇 = 𝑘 𝑖=1 𝜆𝑖 • Contribution of 𝒋th component to data variance: 𝜆𝑗 𝑘 𝑖=1 𝜆𝑖 PCA: Properties • PCA decorrelates the dataset 𝐴𝐶𝑋 𝐴𝑇 = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑘 • PCA gives rank k reconstruction with minimum squared error • PCA is sensitive to the scaling of the original features Data with Single View: Techniques • Principal Components Analysis • Singular Value Decomposition • Oriented Principal Components Analysis Singular Value Decomposition (SVD) • Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 𝑥𝑖 ∈ 𝑅𝑑 • 𝑋 = 𝑈Σ𝑉 𝑇 = 𝑟𝑖=1 𝜎𝑖 𝑢𝑖 𝑣𝑖𝑇 𝑟 = 𝑟𝑎𝑛𝑘 𝑋 𝑈 ∈ 𝑅𝑑×𝑑 such that 𝑈 𝑇 𝑈 = 𝐼 (left singular vectors) 𝑉 ∈ 𝑅𝑛×𝑑 such that 𝑉 𝑇 𝑉 = 𝐼 (right singular vectors) Σ = 𝑑𝑖𝑎𝑔 𝜎1 , … , 𝜎𝑑 ∈ 𝑅𝑑×𝑑 (singular values) 𝑋= 𝑟 𝑇 𝜎 𝑢 𝑣 𝑖=1 𝑖 𝑖 𝑖 • Low rank approximation: 𝑋 = 𝑈Σ𝑉 𝑇 = 𝑘 𝑇 𝜎 𝑢 𝑣 𝑖=1 𝑖 𝑖 𝑖 Σ = 𝑑𝑖𝑎𝑔 𝜎1 , … , 𝜎𝑘 , 0, … , 0 , 𝑘 ≤ 𝑑 SVD and Data Sphering • Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 𝑥𝑖 ∈ 𝑅 𝑑 𝑟 𝑇 𝜎 𝑢 𝑣 𝑖=1 𝑖 𝑖 𝑖 𝑈Σ 2 𝑈 𝑇 = 𝑟𝑖=1 𝜎𝑖2 𝑢𝑖 𝑢𝑖𝑇 • 𝑋 = 𝑈Σ𝑉 𝑇 = – 𝑋𝑋 𝑇 = – Note that 1 𝑇 𝑢 𝜎𝑗 𝑗 𝑟 2 𝑇 𝜎 𝑢 𝑢 𝑖 𝑖=1 𝑖 𝑖 1 𝑢 𝜎𝑗 𝑗 =1 • Let 𝑈𝑘 = 𝑢1 , … , 𝑢𝑘 , 𝑉𝑘 = 𝑣1 , … , 𝑣𝑘 Σ𝑘 = 𝑑𝑖𝑎𝑔 𝜎1 , … , 𝜎𝑘 , 𝑘 ≤ 𝑟 • Σ𝑘−1 𝑈𝑘𝑇 𝑋𝑋 𝑇 𝑈𝑘 Σ𝑘 = 𝐼 • 𝐴𝑋𝑋 𝑇 𝐴𝑇 = 𝐼 where 𝐴 = Σ𝑘−1 𝑈𝑘𝑇 • The linear transformation 𝐴 = Σ𝑘−1 𝑈𝑘𝑇 decorrelates the data set SVD and Eigen Decomposition • Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 • 𝑋 = 𝑈Σ𝑉 𝑇 𝑥𝑖 ∈ 𝑅𝑑 𝑋𝑋 𝑇 = 𝑈Σ𝑉 𝑇 𝑉Σ𝑈 𝑇 = 𝑈Σ2 𝑈 𝑇 (eigen decomposition) 𝑋 𝑇 𝑋 = 𝑉Σ𝑈 𝑇 𝑈Σ𝑉 𝑇 = 𝑉Σ2 𝑉 𝑇 (eigen decomposition) SVD and PCA: SVD on centered 𝑋 is the same as PCA on 𝑋 Data with Single View: Techniques • Principal Components Analysis • Singular Value Decomposition • Oriented Principal Components Analysis Oriented Principal Components Analysis (OPCA) • Generalization of PCA – Along with signal covariance 𝐶𝑋 , a noise covariance 𝐶𝑋 is available • When 𝐶𝑋 = 𝐼 (white noise) OPCA = PCA – Seeks projections that maximize the ratio of the variance of the signal projected to the variance of the noise – Mathematical formulation: 𝐴∗ = argmax 𝑇𝑟 𝐴𝐶𝑋 𝐴𝑇 𝐴∈𝑅 𝑘×𝑑 𝐴𝐶𝑋 𝐴𝑇 =𝐼 OPCA: Solution • Generalized eigenvalue problem: 𝐶𝑋 𝑈 = 𝐶𝑋 𝑈Λ Equivalent eigenvalue problem: 𝐶𝑋 −1/2 𝐶𝑋 𝐶𝑋 −1/2 𝑉 = 𝑉Λ where 𝑉 = 𝐶𝑋 1/2 𝑈 𝑈 = 𝑢1 , 𝑢2 , … , 𝑢𝑑 Λ = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑑 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑑 • 𝐴 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 • 𝑇 𝑥 = 𝐴𝑥 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑥 • MATLAB function: 𝑒𝑖𝑔() OPCA: Properties • Projections remain the same when the noise and signal vectors are globally scaled with two different scale factors • Projected data is not necessarily uncorrelated • Can be extended to multiview data [Platt et al, EMNLP 2010] Data with Single View: Applications • Word Sense Discrimination • Part-of-Speech Tagging • Information Retrieval Data Representation • Vector space characteristics Feature 2 – What are the features ? – What are the feature weights ? Feature weight Distance Feature weight Feature 1 Popular Feature Space Models • Vector Space Model – Document is represented as bag-of-words – Features: words – Feature weight: TF(𝑤𝑖 , 𝑑) or some variant • Word Space Model – Word is represented in terms of its context words – Features: words (with or with out the position) – Feature weight: Freq(𝑤𝑗 , 𝑤𝑖 ) Turney and Pantel 2010 Curse of dimensionality • We have observations 𝑥𝑖 ∈ 𝑅𝑑 • 𝑑 is usually very huge – Vector Space Models • 𝑑 = vocab size (number of words in a language) – Word Space Models • 𝑑 = vocab size (if position is ignored) • 𝑑 = V × L where L is window length • Curse of dimensionality Data with Single View: Applications • Word Sense Discrimination • Part-of-Speech Tagging • Information Retrieval Word Sense Discrimination • Identify which occurrences of a word have same meaning – Different from word sense disambiguation – E.g.: suit (focus word) C1 C2 C3 C4 … they buried him in his best suit … … the family brought suit against the landlord … … judge dismisses suit against yelp … … the right suit size … – Doesn’t need external knowledge Schütze 1998 Word Sense Discrimination • Analysis: Group 1 … they buried him in his best suit … … the right suit size … Group 2 … the family brought suit against the landlord … … judge dismisses suit against yelp … • Testing: … filed suit in small claims court … ? Group 1 ? Group 2 Word Sense Discrimination • Aim: Cluster contexts based on the meaning 1. Word Vectors • Steps: 1. Represent a word as a point in vector space 2. Context • Dimensionality Reduction Vectors 2. Represent each context as a point 3. Cluster the points using a clustering algorithm • Vector Space 3. Sense Vectors – Use words as the features – Feature weight is co-occurrence strength Word Sense Discrimination : 1. Word Vectors • Represent each word in terms of context words legal Clothes … judge 210 75 … robe 50 250 … law 240 50 … suit 147 157 … dismisses 96 152 … legal law judge suit dismisses robe clothes Word Sense Discrimination : 2. Context Vectors • is centroid of all the word vectors in the context C3 … judge dismisses law suit … legal law C3: CENTROID judge suit dismisses robe clothes Word Sense Discrimination : 3. Sense Vectors • Cluster all the context vectors C1 C2 C3 C4 … they buried him in his best suit … … the family brought suit against the landlord … … judge dismisses suit against yelp … … the right suit size … legal Centroid: Sense Vector 1 C2 C3 Centroid: Sense Vector 2 C1 C4 clothes Word Sense Discrimination : Testing C … filed suit in small claims court … • Assign to the closest sense vector legal Centroid: Sense Vector 2 Centroid: Sense Vector 1 C clothes Word Sense Discrimination : Dimensionality Reduction • Reduce the dimensionality of word vectors 𝑊 = legal Clothes … judge 210 75 … robe 50 250 … law 240 50 … suit 147 157 … dismisses 96 152 … • 𝑊 = 𝑈Σ𝑉 𝑇 • 𝑊 𝑛𝑒𝑤 ← 𝑢1 , ⋯ , 𝑢𝑘 Word Sense Discrimination : Results & Discussion • Averaged results on 20 words Accuracy 𝜒 2 , terms 𝜒 2 , SVD Frequency, terms Frequency, SVD 76 90 81 88 Schütze 1998 Data with Single View: Applications • Word Sense Discrimination • Part-of-Speech Tagging • Information Retrieval Part-of-Speech (POS) Tagging • Given a sentence label words with their POS tags I NN ate VB an DT apple . NN . • Unsupervised Approaches – Attempt to cluster words – Align each cluster with a POS tag – Do not assume a dictionary of tags Schütze 1995, Lamar et al 2010 Part-of-Speech Tagging • Steps 1. Represent words in appropriate vector space • Dimensionality Reduction 2. Cluster using your favorite algorithm • Vector space should capture syntactic properties – Use most frequent (𝑑) words as features – Frequency of a word in the context as feature weight Part-of-Speech Tagging : Pass 1 • Construct left and right context matrices – 𝐿 and 𝑅 matrices of size 𝑉 × 𝑑 • Dimensionality Reduction – Get rank 𝑟1 approximation 𝐿 = 𝑈𝐿 Σ𝐿 𝑉𝐿𝑇 𝐿∗ = 𝑈𝐿∗ Σ𝐿∗ 𝐿∗∗ ←Normalized 𝐿∗ 𝑅 = 𝑈𝑅 Σ𝑅 𝑉𝑅𝑇 𝑅∗ = 𝑈𝑅∗ Σ𝑅∗ 𝑅∗∗ ←Normalized 𝑅∗ – 𝐷 = 𝐿∗∗ 𝑅 ∗∗ is a 𝑉 × 2𝑟1 matrix • Run weighted 𝑘-means on 𝐷 with 𝑘1 clusters Part-of-Speech Tagging : Pass 2 • The clusters are not optimal because of sparsity – Construct 𝐿𝑛𝑒𝑤 and 𝑅𝑛𝑒𝑤 of size 𝑉 × 𝑘1 • Dimensionality Reduction – Get rank 𝑟2 approximation 𝐿𝑛𝑒𝑤 = 𝑈𝐿 Σ𝐿 𝑉𝐿𝑇 𝐿∗ = 𝑈𝐿∗ Σ𝐿∗ 𝐿∗∗ ←Normalized 𝐿∗ 𝑅𝑛𝑒𝑤 = 𝑈𝑅 Σ𝑅 𝑉𝑅𝑇 𝑅∗ = 𝑈𝑅∗ Σ𝑅∗ 𝑅∗∗ ←Normalized 𝑅∗ – 𝐷 = 𝐿∗∗ 𝑅 ∗∗ is a 𝑉 × 2𝑟2 matrix • Run weighted 𝑘-means on 𝐷 Part-of-Speech Tagging : Results • Penn Treebank (1.1 M tokens, 43K types) – 17 and 45 tags PTB17 PTB45 SVD2 0.730 0.660 HMM-EM 0.647 0.621 HMM-VB 0.637 0.605 HMM-GS 0.674 0.660 HMM-Sparse(32) 0.702 (2.2) 0.654 (1.0) VEM(10−1 , 10−1 ) 0.682 (0.8) 0.546 (1.7) Lamar et al 2010 Part-of-Speech Tagging : Discussion • Sensitivity to parameters • Scaling with singular values • 𝑘-means algorithm – Weighted 𝑘-means – Clusters are initialized to most frequent word types • Non-disambiguating tagger • Very simple algorithm Data with Single View: Applications • Word Sense Discrimination • Part-of-Speech Tagging • Information Retrieval Information Retrieval • Rank documents 𝑑 in response to a query 𝑞 • Vector Space Model – Query and doc. are represented as bag-of-words – Features: Words Feature Weight: TFIDF • Lexical Gap – Polysemy and Synonymy Information Retrieval : Lexical Gap • Term × Document matrix 𝐶 𝒅𝟏 ship 1 boat 𝒅𝟑 𝒅𝟒 𝒅𝟓 1 1 𝒅𝟔 1 1 ocean 1 voyage 1 trip 𝒅𝟐 1 1 1 TFIDF weighting is better !! Information Retrieval : Latent Semantic Analysis • Term × Document matrix 𝐶𝑉×𝐷 • Steps: 1. Dimensionality Reduction of term × doc. matrix 2. Folding-in queries • 𝑞𝑟𝑒𝑑 ← 𝑓(𝑞) 3. Compute semantic similarity, score 𝑞, 𝑑 Information Retrieval : Latent Semantic Analysis • Term × Document matrix 𝐶𝑉×𝐷 • Steps: 𝑉 × 𝑘 matrix : Representation of Terms 𝐷 ×matrix 𝑘 matrix : 1. Dimensionality Reduction of term × doc. Representation of Documents 𝐶 = 𝑈Σ𝑉 𝑇 𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇 𝐶𝑘 = 𝑉×𝐷 𝑉𝑘𝑇 Σ𝑘 𝑈𝑘 𝑘×𝑘 𝑉×𝑘 𝑘×𝐷 Information Retrieval : Latent Semantic Analysis • Term × Document matrix 𝐶𝑉×𝐷 • Steps: 1. Dimensionality Reduction of term × doc. matrix 𝐶 = 𝑈Σ𝑉 𝑇 𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇 𝑑𝑜𝑟𝑖𝑔 𝐶𝑘 = 𝑉×𝐷 𝑑𝑟𝑒𝑑 𝑉𝑘𝑇 Σ𝑘 𝑈𝑘 𝑘×𝑘 𝑉×𝑘 𝑘×𝐷 𝑑𝑜𝑟𝑖𝑔 = 𝑈𝑘 Σ𝑘 𝑑𝑟𝑒𝑑 Information Retrieval : Latent Semantic Analysis • Term × Document matrix 𝐶𝑉×𝐷 • Steps: 1. Dimensionality Reduction 2. Folding-in queries 𝑑𝑜𝑟𝑖𝑔 = 𝑈𝑘 Σ𝑘 𝑑𝑟𝑒𝑑 𝑑𝑟𝑒𝑑 = Σ𝑘−1 𝑈𝑘𝑇 𝑑𝑜𝑟𝑖𝑔 𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇 ⇒ 𝑞𝑟𝑒𝑑 = Σ𝑘−1 𝑈𝑘𝑇 𝑞 Information Retrieval : Latent Semantic Analysis • Term × Document matrix 𝐶𝑉×𝐷 • Steps: 1. Dimensionality Reduction 2. Folding-in queries 3. Semantic similarity 𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇 𝑞𝑟𝑒𝑑 = Σ𝑘−1 𝑈𝑘𝑇 𝑞 𝑆𝑐𝑜𝑟𝑒 𝑞𝑜𝑟𝑖𝑔 , 𝑑𝑜𝑟𝑖𝑔 ← cos 𝑞𝑟𝑒𝑑 , 𝑑𝑟𝑒𝑑 〈𝑞𝑟𝑒𝑑 , 𝑑𝑟𝑒𝑑 〉 = |𝑞𝑟𝑒𝑑 ||𝑑𝑟𝑒𝑑 | 〈. , . 〉 denotes dot product Deerwester 1988; Dumais 2005 Information Retrieval : Lexical Gap Revisited • Term × Document matrix 𝐶 𝒅𝟏 ship 𝒅𝟐 1 boat 𝒅𝟑 𝒅𝟒 𝒅𝟓 1 1 𝒅𝟔 1 1 ocean 1 voyage 1 1 trip 1 1 • New document representations 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟒 𝒅𝟓 𝒅𝟔 Dim 1 -1.62 -0.60 -0.44 -0.97 -0.70 -0.26 Dim 2 -0.46 -0.84 -0.30 1.00 0.35 0.65 Information Retrieval : Results & Discussion • Term × Document matrix 𝐶 MED CRAN CACM CISI Cos+tfidf 49 35.2 21.9 20.2 LSA 64.6 38.7 23.8 21.9 PLSI-U 69.5 38.9 25.3 23.3 PLSI-Q 63.2 38.6 26.6 23.1 • Fold-in new documents as well – Deviates from the optimal as we add more docs. Hofmann 1999 Data with Single View: Advanced Topics • Non-linear Dimensionality Reduction • Neural Embeddings Data with Single View: Advanced Topics • Non-linear Dimensionality Reduction • Neural Embeddings Non-linear Dimensionality Reduction • Non-linear dimensionality reduction – Locally linear but globally non-linear – E.g.: Locally Linear Embedding, Laplacian Eigenmaps • Locally Linear Embedding 2 1 𝑥𝑖 𝑥𝑖 3 𝑤𝑖𝑘 𝑤𝑖𝑗 𝑥𝑗 𝑥𝑘 𝑤𝑖𝑘 𝑦𝑖 𝑤𝑖𝑗 𝑦𝑘 𝑦𝑗 Non-linear Dimensionality Reduction • Laplacian Eigenmaps – Weight matrix 𝑊 with similarities • Local neighbourhood – 𝐷𝑖𝑖 = 𝑗 𝑊𝑖𝑗 and 𝐿 = 𝐷 − 𝑊 – arg min 𝑢𝑇 𝐿𝑢 s.t. 𝑢𝑇 𝐷𝑢 = 𝐼 𝑢 𝐿𝑢 = 𝜆𝐷𝑢 𝑢𝑇 𝐿𝑢 = 𝑊𝑖𝑗 𝑢𝑖 − 𝑢𝑗 𝑖𝑗 2 Data with Single View: Advanced Topics • Non-linear Dimensionality Reduction • Neural Embeddings Neural Embeddings • Dimensionality reduction with Neural Nets • Task: Statistical Language Modeling – Model the next word given the context – “The cat is walking in the bedroom” input the cat is walking cat is walking in …… output is walking walking in in the the bedroom ……. Bengio et al 2003 Neural Embeddings • Word is represented as vector of size 𝑚 ⋯ 𝑖 𝑡ℎ output 𝑝(𝑤𝑡 = 𝑖|𝑐𝑜𝑛𝑡𝑒𝑥𝑡) ⋯ Non-linearity introduced by tanh Output layer of length 𝑉 ⋯ Hidden layer of length ℎ Vectors of context words ⋯ Input is vector of length 3𝑚 ⋯ ⋯ Neural Embeddings • Word is represented as a vector of size 𝑚 walking ⋯ High probability for “walking” ⋯ ⋯ Vectors of context words ⋯ ⋯ ⋯ the cat is Neural Embeddings • Word is represented as a vector of size 𝑚 in ⋯ High probability for “in” ⋯ ⋯ Vectors of context words ⋯ ⋯ ⋯ cat is walking Neural Embeddings • Word is represented as a vector of size 𝑚 • Learning – Optimize such that log-likelihood is maximized – Gradient ascent – Learns parameters and word vectors simultaneously – Learned word-vectors capture semantics • Learn to perform multiple tasks simultaneously Bengio et al 2003; Collobert and Weston 2008 Road Map • Introduction • NLP and Dimensionality Reduction • Mathematical Background • Data with Single View • Techniques • Applications • Advanced Topics • Data with Multiple Views • Techniques • Applications • Advanced Topics • Summary Data with Multiple Views: Techniques Canonical Correlation Analysis (CCA) • Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅 𝑑1×𝑛 , 𝑌 = 𝑦1 , … , 𝑦𝑛 ∈ 𝑅 𝑑2 ×𝑛 • Project 𝑋 and 𝑌 along 𝑎 ∈ 𝑅𝑑1 and 𝑏 ∈ 𝑅𝑑2 𝑠 = 𝑎𝑇 𝑥1 , … , 𝑎𝑇 𝑥𝑛 𝑇 , 𝑡 = 𝑏 𝑇 𝑦1 , … , 𝑏 𝑇 𝑦𝑛 𝑇 • Data correlation after transformation: 𝑐𝑜𝑠 𝑠, 𝑡 = = 𝑠𝑇 𝑡 𝑠𝑇 𝑠 𝑡 𝑇𝑡 = 𝑛 𝑖=1 𝑛 𝑖=1 𝑎𝑇 𝑥𝑖 𝑏 𝑇 𝑦𝑖 𝑎 𝑇 𝑥𝑖 2 𝑎𝑇 𝑋𝑌 𝑇 𝑏 𝑎𝑇 𝑋𝑋 𝑇 𝑎 𝑏 𝑇 𝑌𝑌 𝑇 𝑏 𝑛 𝑖=1 𝑏 𝑇 𝑦𝑖 2 Canonical Correlation Analysis Training a∗ , b∗ = arg max cos(𝑋 𝑇 𝑎, 𝑌 𝑇 𝑏) a,b = arg min ||𝑋 𝑇 𝑎 − 𝑌 𝑇 𝑏||2 𝑎,𝑏 𝑏1 𝒙2 𝑎2 𝑎1 𝒙1 𝒚3 𝒚2 𝒙3 𝒙1 𝒙3 𝒙2 𝒚1 𝒚3 𝒚2 𝑏2 𝒚1 𝑎1 , 𝑏1 𝒙1𝒙2 𝒚2 𝒚1 𝒚3 𝒙3 𝑎2 , 𝑏2 Canonical Correlation Analysis Training a∗ , b∗ = arg max cos(𝑋 𝑇 𝑎, 𝑌 𝑇 𝑏) a,b = arg min ||𝑋 𝑇 𝑎 − 𝑌 𝑇 𝑏||2 𝑎,𝑏 CCA (contd.) • Covariance matrices: 𝐶𝑋𝑌 = 𝑋𝑌 𝑇 , 𝐶𝑋 = 𝑋𝑋 𝑇 , 𝐶𝑌 = 𝑌𝑌 𝑇 • Correlation in terms of covariance matrices: 𝑐𝑜𝑠 𝑠, 𝑡 = 𝑎𝑇 𝐶𝑋𝑌 𝑏 𝑎𝑇 𝐶𝑋 𝑎 𝑏 𝑇 𝐶𝑌 𝑏 • Directions that maximize data correlation: 𝑎∗ , 𝑏 ∗ = argmax 𝑎,𝑏 𝑎𝑇 𝐶𝑋𝑌 𝑏 𝑎𝑇 𝐶𝑋 𝑎 𝑏 𝑇 𝐶𝑌 𝑏 CCA: Formulation • Goal: Find linear transformations 𝐴∗ , 𝐵∗ that maximize data correlation • Optimization problem: 𝐴∗ , 𝐵 ∗ = argmax 𝑇𝑟 𝐴𝑇 𝑋𝑌 𝑇 𝐵 𝐴,𝐵 𝑠. 𝑡. 𝑇𝑟 𝐴𝑇 𝑋𝑋 𝑇 𝐴 = 1 𝑇𝑟 𝐵𝑇 𝑌𝑌 𝑇 𝐵 = 1 CCA: Solution • Generalized eigenvalue problem: 𝐶𝑋𝑌 𝐵 = 𝐶𝑋𝑋 𝐴Λ𝑋 𝑇 𝐶𝑋𝑌 𝐴 = 𝐶𝑌𝑌 𝐵Λ 𝑌 Can be shown that Λ𝑋 = Λ 𝑌 = Λ −1 𝑇 𝐵 = 𝐶𝑌𝑌 𝐶𝑋𝑌 𝐴Λ−1 −1 𝑇 𝐶𝑋𝑌 𝐶𝑌𝑌 𝐶𝑋𝑌 𝐴 = 𝐶𝑋𝑋 𝐴Λ2 MATLAB function: 𝑐𝑎𝑛𝑜𝑛𝑐𝑜𝑟𝑟() Data with Multiple Views: Applications • Bilingual Document Projections • Mining Word-level Translations Data with Multiple Views: Applications • Bilingual Document Projections • Mining Word-level Translations Bilingual Document Projections • Training data : 𝑛 document pairs • Task: Identify aligned document pairs ?? ?? Bilingual Document Projections • Applications: – Comparable and Parallel Document Retrieval – Cross-language text categorization • Steps: 1. Represent each document as a vector • Two different vector spaces, one per each language 2. Use CCA to find linear transformations (𝐴, 𝐵) 3. Find new aligned documents using 𝐴 and 𝐵 Bilingual Document Projections • Steps: 1. Represent each document as a vector • Vector Space: – Features: Most frequent 20K content words – Feature weight: TFIDF weighting • Training Data: – 𝑥𝑖 ∈ 𝑅𝑑1 bag of English words – 𝑦𝑖 ∈ 𝑅 𝑑2 bag of Hindi words – 𝑥𝑖 , 𝑦𝑖 𝑖 = 1 ⋯ 𝑛 𝑋 = 𝑥1 𝑥2 ⋯ 𝑥𝑛 𝑌 = 𝑦1 𝑦2 ⋯ 𝑦𝑛 Bilingual Document Projections • Steps: 1. Represent each document as a vector 2. Use CCA to find linear transformations 𝐴 and 𝐵 𝒙2 𝒙1 𝒚3 𝒚2 𝒚1 𝒙3 𝐴 𝒚2 𝒙 2 𝒙1 𝒚1 𝒚3 𝒙3 𝐵 Bilingual Document Projections • Steps: 1. Represent each document as a vector 2. Use CCA to find linear transformations 𝐴 and 𝐵 3. Find new aligned documents using 𝐴 and 𝐵 • Scoring: Score(𝑥, 𝑦) ← cos 𝐴𝑥, 𝐵𝑦 〈𝐴𝑥,𝐵𝑦〉 = |𝐴𝑥||𝐵𝑦| Bilingual Document Projections : Results & Discussion Accuracy MRR OPCA 72.55 77.34 Word-by-word 70.33 74.67 CCA 68.94 73.78 Word-by-word (5000) 67.86 72.36 CL-LSI 53.02 61.30 Untranslated 46.92 53.83 CPLSA 45.79 51.30 JPLSA 33.22 36.19 Platt et al 2010 Data with Multiple Views: Applications • Bilingual Document Projections • Mining Word-level Translations Mining Word-Level Translations • Training Data: Word level seed translations English Spanish P(s|e) state estado 0.5 state declarar 0.3 society sociedad 0.4 society compañía 0.35 company sociedad 0.8 • Task: Mine translations for new words – Translations of “stability” ? • Resources: monolingual comparable corpora Mining Word-Level Translations • Applications: – Lexicon induction for resource poor languages – Mining translations for unknown words in MT • Steps: 1. Prepare training data of “word pairs” 2. Represent each word as vector • Two different feature spaces, one per each language 3. Use CCA to find transformations 𝐴 and 𝐵 4. Use 𝐴 and 𝐵 to mine new word translations Mining Word-Level Translations • Steps: 1. Prepare training data of “word pairs” • Reduce many-to-many alignments to one-to-one estado state state estado declarar society sociedad society sociedad company compañía company compañía Mining Word-Level Translations • Steps: 1. Prepare training data of “word pairs” 2. Represent each word as a vector • Vector Space – Features: context words (WSM); Orthography – Feature Weights: TFIDF weights – Can be computed using ONLY comparable corpora • 𝑥𝑖 , 𝑦𝑖 𝑖 = 1 ⋯ 𝑛; 𝑋 = 𝑥1 𝑥2 ⋯ 𝑥𝑛 ; 𝑌 = 𝑦1 𝑦2 ⋯ 𝑦𝑛 Mining Word-Level Translations • Steps: 1. Prepare training data of “word pairs” 2. Represent each word as a vector 3. Use CCA to find transformations 𝐴 and 𝐵 𝒙1 𝒙2 𝒚2 𝒙3 𝐴 𝐵 𝒙1 𝒚2 𝒙 𝒚1 2 𝒚3 𝒙3 𝒚3 𝒚1 Mining Word-Level Translations • Steps: 1. 2. 3. 4. Prepare training data of “word pairs” Represent each word as a vector Use CCA to find transformations 𝐴 and 𝐵 Use 𝐴 and 𝐵 to mine new word translations • Scoring Score(𝑒, 𝑠) = cos 𝐴𝑥𝑒 , 𝐵𝑦𝑠 〈𝐴𝑥𝑒 ,𝐵𝑦𝑠 〉 = |𝐴𝑥𝑒 ||𝐵𝑦𝑠 | Mining Word-Level Translations : Results & Discussion • Seed lexicon size 100 – Bootstrapping 𝒑𝟎.𝟏 𝒑𝟎.𝟐𝟓 𝒑𝟎.𝟑𝟑 𝒑𝟎.𝟓𝟎 EditDist 58.6 62.6 61.1 Ortho 76.0 81.3 80.1 52.3 55.0 Context 91.1 81.3 80.2 65.3 58.0 Both 87.2 89.7 89.0 89.7 72.0 Best-𝐅𝟏 47.4 • Results are lower for other language pairs Haghighi et al 2008 Mining Word-Level Translations : Results & Discussion • Mining translations for unknown words – OOV words for MT domain adaptation French German MT Accuracies (BLEU) Baseline +ve change Baseline +ve change News Emea Subs PHP 23.00 26.62 10.26 38.67 0.80 1.44 0.13 0.28 27.30 40.46 16.91 28.12 0.36 1.51 0.61 0.68 Daumé and Jagarlamudi 2011 Data with Multiple Views : Advanced Topics • Supervised Semantic Indexing • Discriminative Reranking • Multilingual Hashing Data with Multiple Views : Advanced Topics • Supervised Semantic Indexing • Discriminative Reranking • Multilingual Hashing Supervised Semantic Indexing • Task: Learn to rank ads 𝑎 for a given doc. 𝑑 • Training Data: – Pairs of webpages and clicked ads (𝑑, 𝑎+ ) – Randomly chosen pairs 𝑑, 𝑎− • Steps : 1. Represent an ad 𝑎 and a doc. 𝑑 as vectors 2. Learn scoring function 𝑓(𝑎, 𝑑) 3. Rank ads for a given document Bai et al 2009 Supervised Semantic Indexing • Steps : 1. Represent ads and docs. as vectors • Vector Space – Bag-of-word representation • Features: words • Feature weights: TFIDF weight – 𝑎 and 𝑑 are vectors of size 𝑉 Supervised Semantic Indexing • Steps : 1. Represent ads and docs. as vectors 2. Learn scoring function 𝑓(𝑎, 𝑑) • Scoring function Parameters: 𝑉 × 𝑉 𝑓 𝑎, 𝑑 = 𝑑𝑇 𝑊𝑎 𝑊=𝐼 Cosine Similarity 𝑊=𝐷 𝑊 = 𝑈𝑇 𝑉 + 𝐼 Reweighting of words Dimensionality Reduction Different treatment for ads and documents 𝑊 = 𝑈𝑇 𝑈 + 𝐼 Dimensionality Reduction SAME treatment for ads and documents Supervised Semantic Indexing : Learn Scoring Function • Max-Margin 𝑓 𝑑, 𝑎+ − 𝑓 𝑑, 𝑎− > 1 • Objective max 0,1 − 𝑓 𝑑, 𝑎+ + 𝑓(𝑑, 𝑎− ) min 𝑊 𝑑,𝑎+ ,𝑎− • Sub Gradient Descent Supervised Semantic Indexing • Steps: 1. Represent ads and docs. as vectors 2. Learn scoring function 𝑓(𝑎, 𝑑) 3. Rank ads for a given document • Ranking Ads – Compute score using 𝑓 𝑎, 𝑑 and rank Supervised Semantic Indexing : Results & Discussion • 1.9 M pairs for training • 100K pairs for testing Parameters TFIDF Rank Loss 45.60 SSI: 𝑊 = 𝑈 𝑇 𝑉 10𝑘 +𝐼 50×10k 25.83 SSI: 𝑊 = 𝑈 𝑇 𝑉 20𝑘 +𝐼 50×20k 26.68 SSI: 𝑊 = 𝑈 𝑇 𝑉 30𝑘 +𝐼 50×30k 26.98 Bai et al 2009 Supervised Semantic Indexing : Results & Discussion • Ranking wikipedia pages for queries – Rank Loss K=5 K=10 K=20 TFIDF 21.6 14.0 9.14 𝛼LSI + (1 − 𝛼)TFIDF 14.2 9.73 6.36 SSI: 𝑊 = 𝑈 𝑇 𝑈 30𝑘 +𝐼 4.80 3.10 1.87 SSI: 𝑊 = 𝑈 𝑇 𝑉 30𝑘 +𝐼 4.37 2.91 1.80 • Performs better when training data is big Bai et al 2009 Data with Multiple Views : Advanced Topics • Supervised Semantic Indexing • Discriminative Reranking • Multilingual Hashing Discriminative Reranking Input Sent. Ref. output Candidate outputs [Thede & Harper, 99] Buyers stepped in to the NNS VBD IN TO DT -0.1947 NNS VBD RP TO -6.8068 NNS VBD RB -7.0514 NNS VBD -7.1408 NNS -13.752 NNS Score futures pit . . 𝑥𝑖 NNS NN DT NNS NN . Loss 0.12 TO DT NNS NN . 0.12 IN TO DT NNS NN . 0 VBD RP TO DT NNS VB . 0.25 VBD RB TO DT NNS VB . 0.25 𝑦𝑖 𝑦𝑖𝑗 𝑗 = 1 … 𝑚𝑖 • Approach – Find a subspace that respects the preferences – Features are independent 𝜙(𝑥𝑖 , 𝑦𝑖𝑗 ) Discriminative𝑦 Reranking 𝑥 𝑑1 𝑥𝑖1 𝑦𝑖1 𝑥𝑖2 𝑦𝑖2 𝑥𝑖3 𝑥𝑖4 … ⋮ 𝑦𝑖3 𝑑2 𝑦𝑖4 Vector of length 𝑑1 × 𝑑2 … 〈𝑥𝑖1 𝑦𝑖1 , 𝑥𝑖1 𝑦𝑖2 , ⋯ ⋯ ⋯ ⋯ 𝑥𝑖𝑗 𝑦𝑖1 , 𝑥𝑖𝑗 𝑦𝑖2 , ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 〉 〈 w11 , w12 , ⋯ ⋯ ⋯ ⋯ wj1 , wj2 , ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 〉 𝑤𝑖𝑗 = 𝑎𝑖 𝑏𝑗 • Reranker operates in the outer product space – [Szedmak et al., 2006; Wang et al., 2007] • Weight vector is constrained [Bai et al. 10] Low-Dimensional Reranking • Find 𝐴 and 𝐵 s.t. arg max cos(𝐴𝑇 𝑥, 𝐵𝑇 𝑦𝑗 ) ≡ 𝑦 𝑦𝑗 𝑥 𝑦 𝐴 𝑥 𝑦2 𝐵 𝑦2 𝑦 𝑦1 𝑦3 Low-Dimensional Reranking • Find 𝐴 and 𝐵 s.t. arg max cos(𝐴𝑇 𝑥, 𝐵𝑇 𝑦𝑗 ) ≡ 𝑦 𝑦𝑗 1. Score: 𝑎𝑇 𝑥𝑖 𝑦𝑗𝑇 𝑏 Idea 2. Add constraints to penalize incorrect ones – 𝑠𝑐𝑜𝑟𝑒 𝑥𝑖 , 𝑦𝑖 ≥ 𝑠𝑐𝑜𝑟𝑒 𝑥𝑖 , 𝑦𝑖𝑗 + 1 − – 𝑚𝑖𝑗 ≥ 1 − 𝜉𝑖 𝐿𝑖𝑗 𝜉𝑖 𝐿𝑖𝑗 [Tsochantaridis et al. 04] Low-Dimensional Reranking Discriminative 1−𝜆 𝑇 𝑇 arg max 𝑎 𝑋𝑌 𝑏 − a,b,𝜉≥0 𝜆 𝜉𝑖 𝑖 𝑎𝑇 𝑋𝑋 𝑇 𝑎 = 1 and 𝑏 𝑇 𝑌𝑌 𝑇 𝑏 = 1 𝜉𝑖 𝑚𝑖𝑗 ≥ 1 − 𝐿𝑖𝑗 Softened-Disc arg max 1 − 𝜆 𝑎𝑇 𝑋𝑌 𝑇 𝑏 + 𝜆 𝑎,𝑏 𝐿𝑖𝑗 𝑚𝑖𝑗 𝑖𝑗 s.t. length constraints Discriminative Model 𝛼𝑖𝑗 = 𝐿𝑖𝑗 // Initialization Repeat 𝐴 𝑖 ,𝐵 𝑖 ← Softened-Disc X, Y, 𝛼𝑖𝑗 𝑚𝑖𝑗 = 𝑎𝑇 𝑥𝑖 𝑦𝑖𝑇 𝑏 − 𝑎𝑇 𝑥𝑖 𝑦𝑖𝑗 𝑏 𝜓𝑖𝑗 = 1 − 𝑚𝑖𝑗 𝐿𝑖𝑗 𝜉𝑖 = min 0, 𝜓𝑖𝑗 s. t. 𝜓𝑖𝑗 > 0} // Get the current soln. // Compute margins // Potential Slack // Compute Slack If 𝜉𝑖 > 0 𝜉 𝑑𝑖𝑗 = 𝑚𝑖𝑗 − 1 − 𝐿 𝑖 𝑖𝑗 𝛼𝑖𝑗 ← 𝛼𝑖𝑗 − 𝛾𝑑𝑖𝑗 End Until convergence // Update the Lagrangian variables // Slack doesn’t change POS Tagging • Combine with Viterbi score – Interpolation parameter is tuned • Training – Input sentence and Reference tag sequences – Candidates, Score and Loss values Buyers stepped in to the futures pit . -0.1947 NNS VBD RP TO DT NNS NN . -6.8068 NNS VBD RB TO DT NNS NN . -7.0514 NNS VBD IN TO DT NNS NN . -7.1408 NNS VBD RP TO DT NNS VB . -13.752 NNS VBD RB TO DT NNS VB . Score • Testing POS Tagging • Combine with Viterbi score – Interpolation parameter is tuned • Data Statistics • Results English Chinese French Swedish Baseline 96.15 92.31 97.41 93.23 Collins 96.06 92.81 97.35 93.44 Regularized 96.00 92.88 97.38 93.35 Oracle 98.39 98.19 99.00 96.48 POS Tagging • Combine with Viterbi score – Interpolation parameter is tuned • Data Statistics • Results English Chinese French Swedish Baseline 96.15 92.31 97.41 93.23 Collins 96.06 92.81 97.35 93.44 Regularized 96.00 92.88 97.38 93.35 Softened-Disc 96.32 92.87 97.53 93.24 Discriminative 96.3 92.91 97.53 93.36 Oracle 98.39 98.19 99.00 96.48 POS Tagging • Results continued … English Chinese French Swedish Softened-Disc +0.17 +0.56 +0.12 +0.01 Discriminative +0.15 +0.6 +0.12 +0.13 Softened-Disc* +0.92 +4.31 +1.12 +0.08 Discriminative* +0.88 +4.77 +0.9 +0.73 • Interpolation with Viterbi score is crucial • Softened-Disc – Independent of no. training examples – Easy to code and can be solved exactly Jagarlamudi and Daumé 2012 Data with Multiple Views : Advanced Topics • Supervised Semantic Indexing • Discriminative Reranking • Multilingual Hashing Motivation: Fuzzy Name Search Aakash Anand Directory in English Query in English … Lusie Wanderbendie Lucy Vanderwende … Zi Zhou Employee Directory Lucy Vanderwende Lucia Vanderwend Motivation: Multilingual Name Search Aakash Anand Directory in English Query in Kannada … ಲೂಸಿ ವ ಾಂಡರ್ವ ಾಂಡಿ Lucy Vanderwende … Zi Zhou Employee Directory Lucy Vanderwende Lucia Vanderwend Similarity Search: Challenges • Computing nearest neighbors in high dimensions using geometric search techniques is very difficult – All methods are as bad as brute force linear search which is expensive – Approximate techniques such as ANN perform efficiently in dimensions as high as 20; in higher dimensions, the results are rather spotty • Need to do search on commodity hardware • Cross-language search Multilingual Hashing for Similarity Search Lusie 1 0 1 1 1 1 1 1 X Lucy 1 0 1 0 1 1 1 1 X X Cynthia Names X X 0 1 1 1 0 1 0 0 Language-Independent Hash Codes Similarity X X Multilingual Hashing for Similarity Search ಲೂಸಿ 1 0 1 1 1 1 1 1 X Lucy 1 0 1 0 1 1 1 1 X X ಸಿಾಂಥಿಯ Names X X 0 1 1 1 0 1 0 0 Language-Independent Hash Codes Similarity X X Search Overview ಲೂಸಿ 1 0 1 1 1 1 1 1 Query Hash Code 6 0 0 0 0 0 0 0 1 Aaron 5 0 0 0 1 1 0 0 0 Bharat 4 0 0 0 1 0 0 1 1 Cecile 3 0 0 1 0 1 0 1 1 David 3 0 0 1 1 0 1 0 1 Michael 6 0 1 0 0 0 1 1 0 Sanjay 2 0 1 1 1 1 1 1 1 Stuart 6 0 1 1 0 0 0 1 0 Daniel 5 1 0 0 0 0 0 0 1 Rashmi 4 1 0 0 0 0 1 0 1 Albert 1 1 0 1 1 0 1 1 1 Lucy 5 1 1 0 0 1 0 1 0 Kumar Hamming Distance Hash Codes Names What is the advantage? • Scales easily to very large databases • Compact language-independent representation • 32 bits per object • Search is effective and efficient • Hamming nearest-neighbor search • Few milliseconds per query for searching a million objects (single thread on a single processor) What is the challenge? • Language/Script Independent Hash Codes • Learning Hash Functions from Training Data Hash Functions for Multilingual People Search Aaron Bharat Rick David Michael Sanjay Stuart Daniel Rashmi Albert Rashid Kumar ಆರನ್ ಭರತ್ ರಿಕ್ ಡ ೇವಿಡ್ ಮೈಕ ಲ್ g(Rick) = 10101100 h(ರಿಕ್) = 10101100 ಸಂಜಯ ಸಟೂವರ್ಟ್ ಡ ೇನಿಯಲ್ ರಶ್ಮಿ ಆಲ್ಬರ್ಟ್ ರಶ್ಮೇದ್ ಕುಮಾರ್ Training Data g(Rashid) = 10111111 h(ರಶ್ಮೇದ್ = 11111111 Hash)Functions Parallel Names Similar Hash Codes Feature Vectors ^R Ra as sh hi id d$ ic … 1 1 1 1 1 1 0 … ೀ ದ ದೀ ೀ$ ಆಲ … 1 1 1 0 … 1 ^ರ ರಶ 1 1 ಶೀ 1 Character Bigram Features 1-Bit Hash Function • Linear projection followed by thresholding at 0. Aaron -1 Bharat +1 Rick -1 Rashid +1 Kumar +1 O K-Bit Hash Functions • Composed of K 1-bit hash functions. Learning Hash Functions Bits are uncorrelated Equivalent to Minimizing Hamming Distance 50% bits are +1, 50% are -1 Learning Hash Functions (contd.) Linear Relaxation Learning Hash Functions Canonical Correlation Analysis (Hoteling, 1936) Learning Hash Functions: Summary • Given a set of parallel names as training data, find the top K projection vectors for each language using Canonical Correlation Analysis. • Each projection vector gives a 1-bit hash function. • Hash code for a name can be computed by projecting its feature vector on to the projection vectors followed by binarization. Udupa & Kumar, 2010 Fuzzy Name Search: Experimental Setup •Test Sets: •DUMBTIONARY •1231 misspelled names •INTRANET •200 misspelled names •Name Directories: •DUMBTIONARY •550K names from Wikipedia •INTRANET •150K employee names •Training Data: •15K pairs of single token names in English and Hindi •Baselines: •Two popular search engines, Double Metaphone, BM25 Fuzzy Name Search Results on DUMBTIONARY 0 System DD S1 86.12 S2 79.33 DM 78.95 BM25 84.70 M-Hash 87.93 B-Hash 92.53 Precision@1 Very Bad 1 Perfect Competitor Hash Fuzzy Name Search Results on INTRANET 0 System DD DM 54.00 BM25 56.92 M-Hash 70.65 B-Hash 77.79 Precision@1 1 Very Bad Perfect Competitor Hash Multilingual: Experimental Setup •Test Sets • 1000 multi-word names each in Russian, Hebrew, Kannada, Tamil, Hindi •Name Directory: •English Wikipedia Titles •6 Million Titles, 2 Million Unique Words •Baseline: •State-of-the-art Machine Transliteration (NEWS 2009) Multilingual: Experimental Results Algorithm Russian Hebrew Kannada Tamil Hindi Transliteration 0.48 - 0.52 0.29 0.49 B-Hash 0.67 0.69 0.68 0.68 0.69 0 Precision@1 Very Bad 1 Perfect Competitor Hash Road Map • Introduction • NLP and Dimensionality Reduction • Mathematical Background • Data with Single View • Techniques • Applications • Advanced Topics • Data with Multiple Views • Techniques • Applications • Advanced Topics • Summary Summary Dimensionality Reduction #of dim. reduction papers 800 Vision 700 NLP 600 500 400 300 200 100 0 1990 1995 2000 2005 2010 Dimensionality Reduction 1.2 Vision 1 NLP 0.8 Popularity compared to 0.6 Bayesian approaches 0.4 0.2 0 1990 1995 2000 2005 2010 Summary • Dimensionality reduction has merits for NLP – Computational and Feature correlations • Has been explored in unsupervised fashion – But recent novel developments • For multi-view data • If you can formulate your problem as mapping – Try dimensionality reduction – Can solve for the global optimum Summary • Spectral Learning – Provides a way to learn global optimum for generative models • Enriching the existing models – Using word embeddings instead of words • Scalability of the techniques – Doesn’t depend on the number of examples – Large scale SVD References • • • • • • • • • • • • HOTELLING, Harold, 1933. Analysis of a Complex of Statistical Variables into Principal Components. Journal of Educational Psychology, 24(6 & 7), 417–441 & 498–520. Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), Numerical Recipes: The Art of Scientific Computing (3rd ed.), New York: Cambridge University Press John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from discriminative projections. In EMNLP ’10, pages 251–261 Hyvärinen A.; Oja E.; Independent component analysis: algorithms and applications, Journal of Neural Networks, Volume 13 Issue 4-5, May-June 2000, Pages 411 - 430 Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5), 2003: pp. 993–1022 S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol 290, 22 December 2000, 2323–2326. Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24, 1, 97–124. Hinrich Schütze. 1995. Distributional part-of-speech tagging. In EACL 7, pages 141--148. Michael Lamar, Yariv Maron, Mark Johnson, Elie Bienenstock, SVD and clustering for unsupervised POS tagging, In ACL 2010, p.215-219, July 11-16 Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36–40. Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188. Thomas Hofmann, Probabilistic Latent Semantic Indexing, In SIGIR 1999 References • • • • • • • • • • J. B. Tenenbaum, Vin de Silva, and John C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol 290, 22 December 2000. Belkin,M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15, 1373–1396. Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research. Yoshua Bengio , Réjean Ducharme , Pascal Vincent , Christian Janvin, A neural probabilistic language model, The Journal of Machine Learning Research, 3, 3/1/2003 Joseph Turian , Lev Ratinov , Yoshua Bengio, Word representations: a simple and general method for semisupervised learning, In ACL 2010, p.384-394, July 11-16, 2010 R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 2011. Harold Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322–377. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In ACL, pages 771--779. Hal Daumé, III , Jagadeesh Jagarlamudi, Domain adaptation for machine translation by mining unseen words, In ACL 2011, June 19-24 Bing Bai , Jason Weston , David Grangier , Ronan Collobert , Kunihiko Sadamasa , Yanjun Qi , Olivier Chapelle , Kilian Weinberger, Supervised semantic indexing, In CIKM 2009, November 02-06, Hong Kong, China References • • • • • Jagadeesh Jagarlamudi, Hal Daumé, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012 Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in IJCAI-11, IJCAI, 20 July 2011 Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of Personal Names, in Proceedings of EMNLP 2010, October 2010 Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis, in ECIR 2010, 2010 Jagadeesh Jagarlamudi, Hal Daumé, III , Regularized Interlingual Projections: Evaluation on Multilingual Transliteration, in Proceedings of EMNLP-CoNLL 2012.