Download Revisiting Dimensionality Reduction Techniques for NLP

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COLING 2012 Tutorial
December 08, 2012
Revisiting Dimensionality
Reduction Techniques for NLP
Jagadeesh Jagarlamudi
Raghavendra Udupa
University of Maryland
Microsoft Research India
Road Map
• Introduction
• NLP and Dimensionality Reduction
• Mathematical Background
• Data with Single View
• Techniques
• Applications
• Advanced Topics
• Data with Multiple Views
• Techniques
• Applications
• Advanced Topics
• Summary
Road Map
• Introduction
• NLP and Dimensionality Reduction
• Mathematical Background
• Data with Single View
• Techniques
• Applications
• Advanced Topics
• Data with Multiple Views
• Techniques
• Applications
• Advanced Topics
• Summary
NLP and Dimensionality Reduction
Dimensionality Reduction:
Motivation
• Many applications involve high dimensional
(and often sparse) data
• High dimensional data poses several
challenges
– Computational
– Difficulty of Interpretation
– Over Fitting
• However, data often lies (approximately) in a
low dimensional manifold embedded in high
dimensional manifold
Dimensionality Reduction:
Goal
• Given high dimensional data, discover the
underlying low dimensional structure
560 dimensional data
2D Embedding
He et al, Face Recognition Using LaplacianFaces
Dimensionality Reduction:
Benefits
• Computational Efficiency
– K-Nearest Neighbor Search
• Data Compression
– Less storage; millions of data points in RAM
• Data Visualization
– 2D and 3D Scatter Plots
• Latent Structure and Semantics
• Feature Extraction
– Removing distracting variance from data sets
Dimensionality Reduction:
Techniques
• Projective Methods
– find low dimensional projections that extract
useful information from the data, by maximizing
a suitable objective function
– PCA, ICA, LDA
• Manifold Modeling Methods
– find low dimensional subspace that best
preserves the manifold structure in the data, by
modelling the manifold structure
– LLE, Isomap, Laplacian Eigenmaps
Dimensionality Reduction:
Relevance to NLP
• High dimensional data in NLP
– Text Documents
– Context Vectors
• How can Dimensionality Reduction help?
– ‘Semantic’ similarity of documents
– Correlate semantically related terms
– Crosslingual similarity
Mathematical Background
Linear Transformation
Linear Transformation: Illustration
Data Centering
• Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛
• Mean: 𝜇 =
1
𝑛
𝑛
𝑖=1 𝑥𝑖
• Centering: 𝑥𝑖 = 𝑥𝑖 − 𝜇
• Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛
• Mean after centering:
𝜇=
1
𝑛
𝑛
𝑖=1 𝑥𝑖
=
1
𝑛
𝑛
𝑖=1
𝑥𝑖 − 𝜇 = 𝜇 − 𝜇 = 0
• Mean after linear transformation:
1 𝑛
1 𝑛
𝑖=1 𝐴𝑥𝑖 =
𝑖=1 𝐴 𝑥𝑖 − 𝜇 = 𝐴𝜇 − 𝐴𝜇 = 0
𝑛
𝑛
Data Variance
• Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛
• Centered: 𝜇 =
• Variance:
1
𝑛
where 𝐶𝑋 =
1
𝑛
𝑛
𝑖=1 𝑥𝑖
𝑛
𝑖=1
𝑥𝑖
1
𝑋𝑋 𝑇
𝑛
2
Centering doesn’t
change data variance
=0
1
= 𝑇𝑟 𝑋𝑋 𝑇
𝑛
= 𝑇𝑟 𝐶𝑋
(sample covariance)
• Transformed dataset: 𝐴𝑋
• Variance after transformation:
1 𝑛
1
2
𝑇 𝑇
𝑇
𝐴𝑥
=
𝑇𝑟
𝐴𝑋𝑋
𝐴
=
𝑇𝑟
𝐴𝐶
𝐴
𝑖
𝑋
𝑖=1
𝑛
𝑛
Positive Definite Matrices
• Real: 𝑀 ∈ 𝑅𝑝×𝑞
• Square: 𝑝 = 𝑞
• Symmetric: 𝑀𝑖𝑗 = 𝑀𝑗𝑖
• Positive: 𝑥 𝑇 𝑀𝑥 > 0 for all 𝑥 ≠ 0
• Examples:
– Identity Matrix
1 1
–
1 5
– 𝐶𝑋 , 𝐴𝐶𝑋 𝐴𝑇
• Cholesky decomposition: 𝑀 = 𝐿𝐿𝑇
Eigenvalues and Eigenvectors
• 𝑀 ∈ 𝑅𝑝×𝑝
• 𝑀𝑢 = 𝜆𝑢 where 𝑢 is a vector and 𝜆 is a scalar
– eigenvector 𝑢, eigenvalue 𝜆
– {𝜆𝑖 } eigenvalues of M
𝑝
𝑖=1 𝑀𝑖𝑖
• Trace: 𝑇𝑟 𝑀 =
= 𝑖 𝜆𝑖
• Rank: Number of non-zero eigenvalues
Eigensystem of Positivedefinite
Matrices
• 𝑀 ∈ 𝑅𝑝×𝑝
• Positive eigenvalues: 𝜆𝑗 > 0
• Real valued eigenvectors: 𝑢𝑗 ∈ 𝑅𝑝
• Orthonormal eigenvectors: 𝜆𝑖 ≠ 𝜆𝑗 ⇒
𝑇
0 and 𝑢𝑖 𝑢𝑖 = 1 (i. e. 𝑈 𝑇 𝑈 = 𝐼)
• Full rank: 𝑅𝑎𝑛𝑘 𝑀 = 𝑝
• Eigen decomposition: 𝑀 = 𝑈Λ𝑈 𝑇
𝑇
𝑢𝑖 𝑢𝑗
=
Data Variance and Eigenvalues
• Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅𝑑
• Data variance:
1
𝑛
𝑛
𝑖=1
1
𝑛
𝑛
2
𝑥
𝑖=1 𝑖
𝑥𝑖
2
= 𝑇𝑟 𝐶𝑋
• Eigen decomposition: 𝐶𝑋 = 𝑈Λ𝑈 𝑇
• Data variance:
= 𝑇𝑟 𝐶𝑋 =
𝑖 𝜆𝑖
Road Map
• Introduction
• NLP and Dimensionality Reduction
• Mathematical Background
• Data with Single View
• Techniques
• Applications
• Advanced Topics
• Data with Multiple Views
• Techniques
• Applications
• Advanced Topics
• Summary
Data with Single View:
Techniques
• Principal Components Analysis
• Singular Value Decomposition
• Oriented Principal Components Analysis
Data with Single View:
Techniques
• Principal Components Analysis
• Singular Value Decomposition
• Oriented Principal Components Analysis
Principal Components Analysis (PCA)
• Centered Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛 𝑥𝑖 ∈ 𝑅𝑑
• Goal: Find orthonormal linear transformation
𝑇: 𝑅𝑑 → 𝑅 𝑘 that maximizes data variance
 𝑇 𝑥 = 𝐴𝑥
 𝐴𝐴𝑇 = 𝐼
 𝑇𝑟 𝐴𝐶𝑋 𝐴𝑇
Linear transformation
Orthonormal basis
Data variance
• Mathematical formulation:
𝐴∗ = argmax 𝑇𝑟 𝐴𝐶𝑋 𝐴𝑇
𝐴∈𝑅𝑘×𝑑
𝐴𝐴𝑇 =𝐼
PCA: Solution
• Eigen decomposition of 𝐶𝑋 :
 𝐶𝑋 = 𝑈Λ𝑈 𝑇
• 𝑈 = 𝑢1 , 𝑢2 , … , 𝑢𝑑
• Λ = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑑
• 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑑
• 𝐴 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇
• 𝑇 𝑥 = 𝐴𝑥 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑥
• MATLAB function: 𝑝𝑟𝑖𝑛𝑐𝑜𝑚𝑝()
Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics
Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics
Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics
Red: Books, Green: Kitchen , Blue: Dvd, Magenta: Electronics
PCA: Solution (contd.)
• Data variance after transformation:
 𝐴𝑋 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑋
 𝐴𝑋𝑋 𝑇 𝐴𝑇 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑋𝑋 𝑇 𝑢1 , 𝑢2 , … , 𝑢𝑘
= 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑈Λ𝑈 𝑇 𝑢1 , 𝑢2 , … , 𝑢𝑘
= 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑘
 𝑇𝑟 𝐴𝑋𝑋 𝑇 𝐴𝑇 =
𝑘
𝑖=1 𝜆𝑖
• Contribution of 𝒋th component to data
variance:
𝜆𝑗
𝑘
𝑖=1 𝜆𝑖
PCA: Properties
• PCA decorrelates the dataset
 𝐴𝐶𝑋 𝐴𝑇 = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑘
• PCA gives rank k reconstruction with
minimum squared error
• PCA is sensitive to the scaling of the original
features
Data with Single View:
Techniques
• Principal Components Analysis
• Singular Value Decomposition
• Oriented Principal Components Analysis
Singular Value Decomposition (SVD)
• Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛
𝑥𝑖 ∈ 𝑅𝑑
• 𝑋 = 𝑈Σ𝑉 𝑇 = 𝑟𝑖=1 𝜎𝑖 𝑢𝑖 𝑣𝑖𝑇
 𝑟 = 𝑟𝑎𝑛𝑘 𝑋
 𝑈 ∈ 𝑅𝑑×𝑑 such that 𝑈 𝑇 𝑈 = 𝐼 (left singular vectors)
 𝑉 ∈ 𝑅𝑛×𝑑 such that 𝑉 𝑇 𝑉 = 𝐼 (right singular vectors)
 Σ = 𝑑𝑖𝑎𝑔 𝜎1 , … , 𝜎𝑑 ∈ 𝑅𝑑×𝑑 (singular values)
𝑋=
𝑟
𝑇
𝜎
𝑢
𝑣
𝑖=1 𝑖 𝑖 𝑖
• Low rank approximation:
 𝑋 = 𝑈Σ𝑉 𝑇 =
𝑘
𝑇
𝜎
𝑢
𝑣
𝑖=1 𝑖 𝑖 𝑖
 Σ = 𝑑𝑖𝑎𝑔 𝜎1 , … , 𝜎𝑘 , 0, … , 0 , 𝑘 ≤ 𝑑
SVD and Data Sphering
• Centered dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛
𝑥𝑖 ∈ 𝑅 𝑑
𝑟
𝑇
𝜎
𝑢
𝑣
𝑖=1 𝑖 𝑖 𝑖
𝑈Σ 2 𝑈 𝑇 = 𝑟𝑖=1 𝜎𝑖2 𝑢𝑖 𝑢𝑖𝑇
• 𝑋 = 𝑈Σ𝑉 𝑇 =
– 𝑋𝑋 𝑇 =
– Note that
1 𝑇
𝑢
𝜎𝑗 𝑗
𝑟
2
𝑇
𝜎
𝑢
𝑢
𝑖
𝑖=1 𝑖
𝑖
1
𝑢
𝜎𝑗 𝑗
=1
• Let 𝑈𝑘 = 𝑢1 , … , 𝑢𝑘 , 𝑉𝑘 = 𝑣1 , … , 𝑣𝑘
Σ𝑘 = 𝑑𝑖𝑎𝑔 𝜎1 , … , 𝜎𝑘 , 𝑘 ≤ 𝑟
•
Σ𝑘−1 𝑈𝑘𝑇 𝑋𝑋 𝑇 𝑈𝑘 Σ𝑘 = 𝐼
• 𝐴𝑋𝑋 𝑇 𝐴𝑇 = 𝐼 where 𝐴 = Σ𝑘−1 𝑈𝑘𝑇
• The linear transformation 𝐴 = Σ𝑘−1 𝑈𝑘𝑇
decorrelates the data set
SVD and Eigen Decomposition
• Dataset: 𝑋 = 𝑥1 , … , 𝑥𝑛
• 𝑋 = 𝑈Σ𝑉 𝑇
𝑥𝑖 ∈ 𝑅𝑑
 𝑋𝑋 𝑇 = 𝑈Σ𝑉 𝑇 𝑉Σ𝑈 𝑇 = 𝑈Σ2 𝑈 𝑇 (eigen decomposition)
 𝑋 𝑇 𝑋 = 𝑉Σ𝑈 𝑇 𝑈Σ𝑉 𝑇 = 𝑉Σ2 𝑉 𝑇 (eigen decomposition)
 SVD and PCA:
 SVD on centered 𝑋 is the same as PCA on 𝑋
Data with Single View:
Techniques
• Principal Components Analysis
• Singular Value Decomposition
• Oriented Principal Components Analysis
Oriented Principal Components
Analysis (OPCA)
• Generalization of PCA
– Along with signal covariance 𝐶𝑋 , a noise
covariance 𝐶𝑋 is available
• When 𝐶𝑋 = 𝐼 (white noise) OPCA = PCA
– Seeks projections that maximize the ratio of the
variance of the signal projected to the variance
of the noise
– Mathematical formulation:
𝐴∗ = argmax 𝑇𝑟 𝐴𝐶𝑋 𝐴𝑇
𝐴∈𝑅 𝑘×𝑑
𝐴𝐶𝑋 𝐴𝑇 =𝐼
OPCA: Solution
• Generalized eigenvalue problem:
 𝐶𝑋 𝑈 = 𝐶𝑋 𝑈Λ
 Equivalent eigenvalue problem:
 𝐶𝑋 −1/2 𝐶𝑋 𝐶𝑋 −1/2 𝑉 = 𝑉Λ where 𝑉 = 𝐶𝑋 1/2 𝑈
 𝑈 = 𝑢1 , 𝑢2 , … , 𝑢𝑑
 Λ = 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2 , … , 𝜆𝑑
 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑑
• 𝐴 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇
• 𝑇 𝑥 = 𝐴𝑥 = 𝑢1 , 𝑢2 , … , 𝑢𝑘 𝑇 𝑥
• MATLAB function: 𝑒𝑖𝑔()
OPCA: Properties
• Projections remain the same when the noise
and signal vectors are globally scaled with
two different scale factors
• Projected data is not necessarily uncorrelated
• Can be extended to multiview data [Platt et
al, EMNLP 2010]
Data with Single View:
Applications
• Word Sense Discrimination
• Part-of-Speech Tagging
• Information Retrieval
Data Representation
• Vector space characteristics
Feature 2
– What are the features ?
– What are the feature weights ?
Feature
weight
Distance
Feature
weight
Feature 1
Popular Feature Space Models
• Vector Space Model
– Document is represented as bag-of-words
– Features: words
– Feature weight: TF(𝑤𝑖 , 𝑑) or some variant
• Word Space Model
– Word is represented in terms of its context words
– Features: words (with or with out the position)
– Feature weight: Freq(𝑤𝑗 , 𝑤𝑖 )
Turney and Pantel 2010
Curse of dimensionality
• We have observations 𝑥𝑖 ∈ 𝑅𝑑
• 𝑑 is usually very huge
– Vector Space Models
• 𝑑 = vocab size (number of words in a language)
– Word Space Models
• 𝑑 = vocab size (if position is ignored)
• 𝑑 = V × L where L is window length
• Curse of dimensionality
Data with Single View:
Applications
• Word Sense Discrimination
• Part-of-Speech Tagging
• Information Retrieval
Word Sense Discrimination
• Identify which occurrences of a word have
same meaning
– Different from word sense disambiguation
– E.g.: suit (focus word)
C1
C2
C3
C4
… they buried him in his best suit …
… the family brought suit against the landlord …
… judge dismisses suit against yelp …
… the right suit size …
– Doesn’t need external knowledge
Schütze 1998
Word Sense Discrimination
• Analysis:
Group 1
… they buried him in his best suit …
… the right suit size …
Group 2
… the family brought suit against the landlord …
… judge dismisses suit against yelp …
• Testing:
… filed suit in small claims court …
?
Group 1
?
Group 2
Word Sense Discrimination
• Aim: Cluster contexts based on the meaning
1. Word Vectors
• Steps:
1. Represent a word as a point in vector space
2. Context
• Dimensionality Reduction
Vectors
2. Represent each context as a point
3. Cluster the points using a clustering algorithm
• Vector Space
3. Sense Vectors
– Use words as the features
– Feature weight is co-occurrence strength
Word Sense Discrimination :
1. Word Vectors
• Represent each word in terms of context words
legal
Clothes
…
judge
210
75
…
robe
50
250
…
law
240
50
…
suit
147
157
…
dismisses
96
152
…
legal
law
judge
suit
dismisses
robe
clothes
Word Sense Discrimination :
2. Context Vectors
• is centroid of all the word vectors in the context
C3 … judge dismisses law suit …
legal
law
C3: CENTROID
judge
suit
dismisses
robe
clothes
Word Sense Discrimination :
3. Sense Vectors
• Cluster all the context vectors
C1
C2
C3
C4
… they buried him in his best suit …
… the family brought suit against the landlord …
… judge dismisses suit against yelp …
… the right suit size …
legal
Centroid:
Sense Vector 1
C2
C3
Centroid:
Sense Vector 2
C1
C4
clothes
Word Sense Discrimination :
Testing
C
… filed suit in small claims court …
• Assign to the closest sense vector
legal
Centroid:
Sense Vector 2
Centroid:
Sense Vector 1
C
clothes
Word Sense Discrimination :
Dimensionality Reduction
• Reduce the dimensionality of word vectors
𝑊 =
legal
Clothes
…
judge
210
75
…
robe
50
250
…
law
240
50
…
suit
147
157
…
dismisses
96
152
…
• 𝑊 = 𝑈Σ𝑉 𝑇
• 𝑊 𝑛𝑒𝑤 ← 𝑢1 , ⋯ , 𝑢𝑘
Word Sense Discrimination :
Results & Discussion
• Averaged results on 20 words
Accuracy
𝜒 2 , terms
𝜒 2 , SVD
Frequency, terms
Frequency, SVD
76
90
81
88
Schütze 1998
Data with Single View:
Applications
• Word Sense Discrimination
• Part-of-Speech Tagging
• Information Retrieval
Part-of-Speech (POS) Tagging
• Given a sentence label words with their POS tags
I
NN
ate
VB
an
DT
apple .
NN .
• Unsupervised Approaches
– Attempt to cluster words
– Align each cluster with a POS tag
– Do not assume a dictionary of tags
Schütze 1995, Lamar et al 2010
Part-of-Speech Tagging
• Steps
1. Represent words in appropriate vector space
•
Dimensionality Reduction
2. Cluster using your favorite algorithm
• Vector space should capture syntactic properties
– Use most frequent (𝑑) words as features
– Frequency of a word in the context as feature weight
Part-of-Speech Tagging :
Pass 1
• Construct left and right context matrices
– 𝐿 and 𝑅 matrices of size 𝑉 × 𝑑
• Dimensionality Reduction
– Get rank 𝑟1 approximation
𝐿 = 𝑈𝐿 Σ𝐿 𝑉𝐿𝑇
𝐿∗ = 𝑈𝐿∗ Σ𝐿∗
𝐿∗∗ ←Normalized 𝐿∗
𝑅 = 𝑈𝑅 Σ𝑅 𝑉𝑅𝑇
𝑅∗ = 𝑈𝑅∗ Σ𝑅∗
𝑅∗∗ ←Normalized 𝑅∗
– 𝐷 = 𝐿∗∗ 𝑅 ∗∗ is a 𝑉 × 2𝑟1 matrix
• Run weighted 𝑘-means on 𝐷 with 𝑘1 clusters
Part-of-Speech Tagging :
Pass 2
• The clusters are not optimal because of sparsity
– Construct 𝐿𝑛𝑒𝑤 and 𝑅𝑛𝑒𝑤 of size 𝑉 × 𝑘1
• Dimensionality Reduction
– Get rank 𝑟2 approximation
𝐿𝑛𝑒𝑤 = 𝑈𝐿 Σ𝐿 𝑉𝐿𝑇
𝐿∗ = 𝑈𝐿∗ Σ𝐿∗
𝐿∗∗ ←Normalized 𝐿∗
𝑅𝑛𝑒𝑤 = 𝑈𝑅 Σ𝑅 𝑉𝑅𝑇
𝑅∗ = 𝑈𝑅∗ Σ𝑅∗
𝑅∗∗ ←Normalized 𝑅∗
– 𝐷 = 𝐿∗∗ 𝑅 ∗∗ is a 𝑉 × 2𝑟2 matrix
• Run weighted 𝑘-means on 𝐷
Part-of-Speech Tagging :
Results
• Penn Treebank (1.1 M tokens, 43K types)
– 17 and 45 tags
PTB17
PTB45
SVD2
0.730
0.660
HMM-EM
0.647
0.621
HMM-VB
0.637
0.605
HMM-GS
0.674
0.660
HMM-Sparse(32)
0.702 (2.2)
0.654 (1.0)
VEM(10−1 , 10−1 )
0.682 (0.8)
0.546 (1.7)
Lamar et al 2010
Part-of-Speech Tagging :
Discussion
• Sensitivity to parameters
• Scaling with singular values
• 𝑘-means algorithm
– Weighted 𝑘-means
– Clusters are initialized to most frequent word types
• Non-disambiguating tagger
• Very simple algorithm
Data with Single View:
Applications
• Word Sense Discrimination
• Part-of-Speech Tagging
• Information Retrieval
Information Retrieval
• Rank documents 𝑑 in response to a query 𝑞
• Vector Space Model
– Query and doc. are represented as bag-of-words
– Features: Words
Feature Weight: TFIDF
• Lexical Gap
– Polysemy and Synonymy
Information Retrieval :
Lexical Gap
• Term × Document matrix 𝐶
𝒅𝟏
ship
1
boat
𝒅𝟑
𝒅𝟒
𝒅𝟓
1
1
𝒅𝟔
1
1
ocean
1
voyage
1
trip
𝒅𝟐
1
1
1
TFIDF weighting
is better !!
Information Retrieval :
Latent Semantic Analysis
• Term × Document matrix 𝐶𝑉×𝐷
• Steps:
1. Dimensionality Reduction of term × doc. matrix
2. Folding-in queries
•
𝑞𝑟𝑒𝑑 ← 𝑓(𝑞)
3. Compute semantic similarity, score 𝑞, 𝑑
Information Retrieval :
Latent Semantic Analysis
• Term × Document matrix 𝐶𝑉×𝐷
• Steps:
𝑉 × 𝑘 matrix :
Representation
of Terms
𝐷 ×matrix
𝑘 matrix :
1. Dimensionality Reduction of term × doc.
Representation
of Documents
𝐶 = 𝑈Σ𝑉 𝑇
𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇
𝐶𝑘
=
𝑉×𝐷
𝑉𝑘𝑇
Σ𝑘
𝑈𝑘
𝑘×𝑘
𝑉×𝑘
𝑘×𝐷
Information Retrieval :
Latent Semantic Analysis
• Term × Document matrix 𝐶𝑉×𝐷
• Steps:
1. Dimensionality Reduction of term × doc. matrix
𝐶 = 𝑈Σ𝑉 𝑇
𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇
𝑑𝑜𝑟𝑖𝑔
𝐶𝑘
=
𝑉×𝐷
𝑑𝑟𝑒𝑑
𝑉𝑘𝑇
Σ𝑘
𝑈𝑘
𝑘×𝑘
𝑉×𝑘
𝑘×𝐷
𝑑𝑜𝑟𝑖𝑔 = 𝑈𝑘 Σ𝑘 𝑑𝑟𝑒𝑑
Information Retrieval :
Latent Semantic Analysis
• Term × Document matrix 𝐶𝑉×𝐷
• Steps:
1. Dimensionality Reduction
2. Folding-in queries
𝑑𝑜𝑟𝑖𝑔 = 𝑈𝑘 Σ𝑘 𝑑𝑟𝑒𝑑
𝑑𝑟𝑒𝑑 = Σ𝑘−1 𝑈𝑘𝑇 𝑑𝑜𝑟𝑖𝑔
𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇
⇒ 𝑞𝑟𝑒𝑑 = Σ𝑘−1 𝑈𝑘𝑇 𝑞
Information Retrieval :
Latent Semantic Analysis
• Term × Document matrix 𝐶𝑉×𝐷
• Steps:
1. Dimensionality Reduction
2. Folding-in queries
3. Semantic similarity
𝐶𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇
𝑞𝑟𝑒𝑑 = Σ𝑘−1 𝑈𝑘𝑇 𝑞
𝑆𝑐𝑜𝑟𝑒 𝑞𝑜𝑟𝑖𝑔 , 𝑑𝑜𝑟𝑖𝑔 ← cos 𝑞𝑟𝑒𝑑 , 𝑑𝑟𝑒𝑑
〈𝑞𝑟𝑒𝑑 , 𝑑𝑟𝑒𝑑 〉
=
|𝑞𝑟𝑒𝑑 ||𝑑𝑟𝑒𝑑 |
〈. , . 〉 denotes dot product
Deerwester 1988; Dumais 2005
Information Retrieval :
Lexical Gap Revisited
• Term × Document matrix 𝐶
𝒅𝟏
ship
𝒅𝟐
1
boat
𝒅𝟑
𝒅𝟒
𝒅𝟓
1
1
𝒅𝟔
1
1
ocean
1
voyage
1
1
trip
1
1
• New document representations
𝒅𝟏
𝒅𝟐
𝒅𝟑
𝒅𝟒
𝒅𝟓
𝒅𝟔
Dim 1
-1.62
-0.60
-0.44
-0.97
-0.70
-0.26
Dim 2
-0.46
-0.84
-0.30
1.00
0.35
0.65
Information Retrieval :
Results & Discussion
• Term × Document matrix 𝐶
MED
CRAN
CACM
CISI
Cos+tfidf
49
35.2
21.9
20.2
LSA
64.6
38.7
23.8
21.9
PLSI-U
69.5
38.9
25.3
23.3
PLSI-Q
63.2
38.6
26.6
23.1
• Fold-in new documents as well
– Deviates from the optimal as we add more docs.
Hofmann 1999
Data with Single View:
Advanced Topics
• Non-linear Dimensionality Reduction
• Neural Embeddings
Data with Single View:
Advanced Topics
• Non-linear Dimensionality Reduction
• Neural Embeddings
Non-linear Dimensionality Reduction
• Non-linear dimensionality reduction
– Locally linear but globally non-linear
– E.g.: Locally Linear Embedding, Laplacian Eigenmaps
• Locally Linear Embedding
2
1
𝑥𝑖
𝑥𝑖
3
𝑤𝑖𝑘
𝑤𝑖𝑗
𝑥𝑗
𝑥𝑘
𝑤𝑖𝑘
𝑦𝑖
𝑤𝑖𝑗
𝑦𝑘
𝑦𝑗
Non-linear Dimensionality Reduction
• Laplacian Eigenmaps
– Weight matrix 𝑊 with similarities
• Local neighbourhood
– 𝐷𝑖𝑖 =
𝑗 𝑊𝑖𝑗
and 𝐿 = 𝐷 − 𝑊
– arg min 𝑢𝑇 𝐿𝑢 s.t. 𝑢𝑇 𝐷𝑢 = 𝐼
𝑢
𝐿𝑢 = 𝜆𝐷𝑢
𝑢𝑇 𝐿𝑢
=
𝑊𝑖𝑗 𝑢𝑖 − 𝑢𝑗
𝑖𝑗
2
Data with Single View:
Advanced Topics
• Non-linear Dimensionality Reduction
• Neural Embeddings
Neural Embeddings
• Dimensionality reduction with Neural Nets
• Task: Statistical Language Modeling
– Model the next word given the context
– “The cat is walking in the bedroom”
input
the
cat
is
walking
cat
is
walking
in
……
output
is
walking
walking in
in
the
the
bedroom
…….
Bengio et al 2003
Neural Embeddings
• Word is represented as vector of size 𝑚
⋯
𝑖 𝑡ℎ output
𝑝(𝑤𝑡 = 𝑖|𝑐𝑜𝑛𝑡𝑒𝑥𝑡)
⋯
Non-linearity
introduced by tanh
Output layer
of length 𝑉
⋯
Hidden layer
of length ℎ
Vectors of
context words
⋯
Input is vector of
length 3𝑚
⋯
⋯
Neural Embeddings
• Word is represented as a vector of size 𝑚
walking
⋯
High probability for
“walking”
⋯
⋯
Vectors of
context words
⋯
⋯
⋯
the
cat
is
Neural Embeddings
• Word is represented as a vector of size 𝑚
in
⋯
High probability for
“in”
⋯
⋯
Vectors of
context words
⋯
⋯
⋯
cat
is
walking
Neural Embeddings
• Word is represented as a vector of size 𝑚
• Learning
– Optimize such that log-likelihood is maximized
– Gradient ascent
– Learns parameters and word vectors simultaneously
– Learned word-vectors capture semantics
• Learn to perform multiple tasks simultaneously
Bengio et al 2003; Collobert and Weston 2008
Road Map
• Introduction
• NLP and Dimensionality Reduction
• Mathematical Background
• Data with Single View
• Techniques
• Applications
• Advanced Topics
• Data with Multiple Views
• Techniques
• Applications
• Advanced Topics
• Summary
Data with Multiple Views:
Techniques
Canonical Correlation Analysis (CCA)
• Centered dataset:
 𝑋 = 𝑥1 , … , 𝑥𝑛 ∈ 𝑅 𝑑1×𝑛 , 𝑌 = 𝑦1 , … , 𝑦𝑛 ∈ 𝑅 𝑑2 ×𝑛
• Project 𝑋 and 𝑌 along 𝑎 ∈ 𝑅𝑑1 and 𝑏 ∈ 𝑅𝑑2
 𝑠 = 𝑎𝑇 𝑥1 , … , 𝑎𝑇 𝑥𝑛 𝑇 , 𝑡 = 𝑏 𝑇 𝑦1 , … , 𝑏 𝑇 𝑦𝑛
𝑇
• Data correlation after transformation:
 𝑐𝑜𝑠 𝑠, 𝑡 =
=
𝑠𝑇 𝑡
𝑠𝑇 𝑠
𝑡 𝑇𝑡
=
𝑛
𝑖=1
𝑛
𝑖=1
𝑎𝑇 𝑥𝑖 𝑏 𝑇 𝑦𝑖
𝑎 𝑇 𝑥𝑖 2
𝑎𝑇 𝑋𝑌 𝑇 𝑏
𝑎𝑇 𝑋𝑋 𝑇 𝑎 𝑏 𝑇 𝑌𝑌 𝑇 𝑏
𝑛
𝑖=1
𝑏 𝑇 𝑦𝑖 2
Canonical Correlation Analysis
Training
a∗ , b∗ = arg max cos(𝑋 𝑇 𝑎, 𝑌 𝑇 𝑏)
a,b
= arg min ||𝑋 𝑇 𝑎 − 𝑌 𝑇 𝑏||2
𝑎,𝑏
𝑏1
𝒙2
𝑎2
𝑎1
𝒙1
𝒚3
𝒚2
𝒙3
𝒙1 𝒙3 𝒙2
𝒚1
𝒚3 𝒚2
𝑏2
𝒚1
𝑎1 , 𝑏1
𝒙1𝒙2
𝒚2 𝒚1 𝒚3
𝒙3
𝑎2 , 𝑏2
Canonical Correlation Analysis
Training
a∗ , b∗ = arg max cos(𝑋 𝑇 𝑎, 𝑌 𝑇 𝑏)
a,b
= arg min ||𝑋 𝑇 𝑎 − 𝑌 𝑇 𝑏||2
𝑎,𝑏
CCA (contd.)
• Covariance matrices:
𝐶𝑋𝑌 = 𝑋𝑌 𝑇 , 𝐶𝑋 = 𝑋𝑋 𝑇 , 𝐶𝑌 = 𝑌𝑌 𝑇
• Correlation in terms of covariance matrices:
𝑐𝑜𝑠 𝑠, 𝑡 =
𝑎𝑇 𝐶𝑋𝑌 𝑏
𝑎𝑇 𝐶𝑋 𝑎 𝑏 𝑇 𝐶𝑌 𝑏
• Directions that maximize data correlation:
𝑎∗ , 𝑏 ∗ = argmax
𝑎,𝑏
𝑎𝑇 𝐶𝑋𝑌 𝑏
𝑎𝑇 𝐶𝑋 𝑎 𝑏 𝑇 𝐶𝑌 𝑏
CCA: Formulation
• Goal: Find linear transformations 𝐴∗ , 𝐵∗ that
maximize data correlation
• Optimization problem:
𝐴∗ , 𝐵 ∗ = argmax 𝑇𝑟 𝐴𝑇 𝑋𝑌 𝑇 𝐵
𝐴,𝐵
𝑠. 𝑡.
𝑇𝑟 𝐴𝑇 𝑋𝑋 𝑇 𝐴 = 1
𝑇𝑟 𝐵𝑇 𝑌𝑌 𝑇 𝐵 = 1
CCA: Solution
• Generalized eigenvalue problem:
 𝐶𝑋𝑌 𝐵 = 𝐶𝑋𝑋 𝐴Λ𝑋
𝑇
 𝐶𝑋𝑌
𝐴 = 𝐶𝑌𝑌 𝐵Λ 𝑌
 Can be shown that Λ𝑋 = Λ 𝑌 = Λ
−1 𝑇
 𝐵 = 𝐶𝑌𝑌
𝐶𝑋𝑌 𝐴Λ−1
−1 𝑇
 𝐶𝑋𝑌 𝐶𝑌𝑌
𝐶𝑋𝑌 𝐴 = 𝐶𝑋𝑋 𝐴Λ2
 MATLAB function: 𝑐𝑎𝑛𝑜𝑛𝑐𝑜𝑟𝑟()
Data with Multiple Views:
Applications
• Bilingual Document Projections
• Mining Word-level Translations
Data with Multiple Views:
Applications
• Bilingual Document Projections
• Mining Word-level Translations
Bilingual Document Projections
• Training data : 𝑛 document pairs
• Task: Identify aligned document pairs
??
??
Bilingual Document Projections
• Applications:
– Comparable and Parallel Document Retrieval
– Cross-language text categorization
• Steps:
1. Represent each document as a vector
•
Two different vector spaces, one per each language
2. Use CCA to find linear transformations (𝐴, 𝐵)
3. Find new aligned documents using 𝐴 and 𝐵
Bilingual Document Projections
• Steps:
1. Represent each document as a vector
• Vector Space:
– Features: Most frequent 20K content words
– Feature weight: TFIDF weighting
• Training Data:
– 𝑥𝑖 ∈ 𝑅𝑑1  bag of English words
– 𝑦𝑖 ∈ 𝑅 𝑑2  bag of Hindi words
– 𝑥𝑖 , 𝑦𝑖 𝑖 = 1 ⋯ 𝑛 𝑋 = 𝑥1 𝑥2 ⋯ 𝑥𝑛 𝑌 = 𝑦1 𝑦2 ⋯ 𝑦𝑛
Bilingual Document Projections
• Steps:
1. Represent each document as a vector
2. Use CCA to find linear transformations 𝐴 and 𝐵
𝒙2
𝒙1
𝒚3
𝒚2
𝒚1
𝒙3
𝐴
𝒚2 𝒙 2
𝒙1
𝒚1
𝒚3
𝒙3
𝐵
Bilingual Document Projections
• Steps:
1. Represent each document as a vector
2. Use CCA to find linear transformations 𝐴 and 𝐵
3. Find new aligned documents using 𝐴 and 𝐵
• Scoring:
Score(𝑥, 𝑦) ← cos 𝐴𝑥, 𝐵𝑦
〈𝐴𝑥,𝐵𝑦〉
=
|𝐴𝑥||𝐵𝑦|
Bilingual Document Projections :
Results & Discussion
Accuracy
MRR
OPCA
72.55
77.34
Word-by-word
70.33
74.67
CCA
68.94
73.78
Word-by-word (5000)
67.86
72.36
CL-LSI
53.02
61.30
Untranslated
46.92
53.83
CPLSA
45.79
51.30
JPLSA
33.22
36.19
Platt et al 2010
Data with Multiple Views:
Applications
• Bilingual Document Projections
• Mining Word-level Translations
Mining Word-Level Translations
• Training Data: Word level seed translations
English
Spanish
P(s|e)
state
estado
0.5
state
declarar
0.3
society
sociedad
0.4
society
compañía
0.35
company
sociedad
0.8
• Task: Mine translations for new words
– Translations of “stability” ?
• Resources: monolingual comparable corpora
Mining Word-Level Translations
• Applications:
– Lexicon induction for resource poor languages
– Mining translations for unknown words in MT
• Steps:
1. Prepare training data of “word pairs”
2. Represent each word as vector
•
Two different feature spaces, one per each language
3. Use CCA to find transformations 𝐴 and 𝐵
4. Use 𝐴 and 𝐵 to mine new word translations
Mining Word-Level Translations
• Steps:
1. Prepare training data of “word pairs”
• Reduce many-to-many alignments to one-to-one
estado
state
state
estado
declarar
society
sociedad
society
sociedad
company
compañía
company
compañía
Mining Word-Level Translations
• Steps:
1. Prepare training data of “word pairs”
2. Represent each word as a vector
• Vector Space
– Features: context words (WSM); Orthography
– Feature Weights: TFIDF weights
– Can be computed using ONLY comparable corpora
•
𝑥𝑖 , 𝑦𝑖 𝑖 = 1 ⋯ 𝑛; 𝑋 = 𝑥1 𝑥2 ⋯ 𝑥𝑛 ; 𝑌 = 𝑦1 𝑦2 ⋯ 𝑦𝑛
Mining Word-Level Translations
• Steps:
1. Prepare training data of “word pairs”
2. Represent each word as a vector
3. Use CCA to find transformations 𝐴 and 𝐵
𝒙1
𝒙2
𝒚2
𝒙3
𝐴
𝐵
𝒙1
𝒚2 𝒙
𝒚1
2
𝒚3
𝒙3
𝒚3
𝒚1
Mining Word-Level Translations
• Steps:
1.
2.
3.
4.
Prepare training data of “word pairs”
Represent each word as a vector
Use CCA to find transformations 𝐴 and 𝐵
Use 𝐴 and 𝐵 to mine new word translations
• Scoring
Score(𝑒, 𝑠) = cos 𝐴𝑥𝑒 , 𝐵𝑦𝑠
〈𝐴𝑥𝑒 ,𝐵𝑦𝑠 〉
=
|𝐴𝑥𝑒 ||𝐵𝑦𝑠 |
Mining Word-Level Translations :
Results & Discussion
• Seed lexicon size 100
– Bootstrapping
𝒑𝟎.𝟏
𝒑𝟎.𝟐𝟓
𝒑𝟎.𝟑𝟑
𝒑𝟎.𝟓𝟎
EditDist
58.6
62.6
61.1
Ortho
76.0
81.3
80.1
52.3
55.0
Context
91.1
81.3
80.2
65.3
58.0
Both
87.2
89.7
89.0
89.7
72.0
Best-𝐅𝟏
47.4
• Results are lower for other language pairs
Haghighi et al 2008
Mining Word-Level Translations :
Results & Discussion
• Mining translations for unknown words
– OOV words for MT domain adaptation
French German
MT Accuracies (BLEU)
Baseline
+ve change
Baseline
+ve change
News
Emea
Subs
PHP
23.00
26.62
10.26
38.67
0.80
1.44
0.13
0.28
27.30
40.46
16.91
28.12
0.36
1.51
0.61
0.68
Daumé and Jagarlamudi 2011
Data with Multiple Views :
Advanced Topics
• Supervised Semantic Indexing
• Discriminative Reranking
• Multilingual Hashing
Data with Multiple Views :
Advanced Topics
• Supervised Semantic Indexing
• Discriminative Reranking
• Multilingual Hashing
Supervised Semantic Indexing
• Task: Learn to rank ads 𝑎 for a given doc. 𝑑
• Training Data:
– Pairs of webpages and clicked ads (𝑑, 𝑎+ )
– Randomly chosen pairs 𝑑, 𝑎−
• Steps :
1. Represent an ad 𝑎 and a doc. 𝑑 as vectors
2. Learn scoring function 𝑓(𝑎, 𝑑)
3. Rank ads for a given document
Bai et al 2009
Supervised Semantic Indexing
• Steps :
1. Represent ads and docs. as vectors
• Vector Space
– Bag-of-word representation
• Features: words
• Feature weights: TFIDF weight
– 𝑎 and 𝑑 are vectors of size 𝑉
Supervised Semantic Indexing
• Steps :
1. Represent ads and docs. as vectors
2. Learn scoring function 𝑓(𝑎, 𝑑)
• Scoring function
Parameters: 𝑉 × 𝑉
𝑓 𝑎, 𝑑 = 𝑑𝑇 𝑊𝑎
𝑊=𝐼
Cosine
Similarity
𝑊=𝐷
𝑊 = 𝑈𝑇 𝑉 + 𝐼
Reweighting
of words
Dimensionality Reduction
Different treatment for
ads and documents
𝑊 = 𝑈𝑇 𝑈 + 𝐼
Dimensionality Reduction
SAME treatment for ads
and documents
Supervised Semantic Indexing :
Learn Scoring Function
• Max-Margin 𝑓 𝑑, 𝑎+ − 𝑓 𝑑, 𝑎− > 1
• Objective
max 0,1 − 𝑓 𝑑, 𝑎+ + 𝑓(𝑑, 𝑎− )
min
𝑊
𝑑,𝑎+ ,𝑎−
• Sub Gradient Descent
Supervised Semantic Indexing
• Steps:
1. Represent ads and docs. as vectors
2. Learn scoring function 𝑓(𝑎, 𝑑)
3. Rank ads for a given document
• Ranking Ads
– Compute score using 𝑓 𝑎, 𝑑 and rank
Supervised Semantic Indexing :
Results & Discussion
• 1.9 M pairs for training
• 100K pairs for testing
Parameters
TFIDF
Rank Loss
45.60
SSI: 𝑊 = 𝑈 𝑇 𝑉
10𝑘
+𝐼
50×10k
25.83
SSI: 𝑊 = 𝑈 𝑇 𝑉
20𝑘
+𝐼
50×20k
26.68
SSI: 𝑊 = 𝑈 𝑇 𝑉
30𝑘
+𝐼
50×30k
26.98
Bai et al 2009
Supervised Semantic Indexing :
Results & Discussion
• Ranking wikipedia pages for queries
– Rank Loss
K=5
K=10
K=20
TFIDF
21.6
14.0
9.14
𝛼LSI + (1 − 𝛼)TFIDF
14.2
9.73
6.36
SSI: 𝑊 = 𝑈 𝑇 𝑈
30𝑘
+𝐼
4.80
3.10
1.87
SSI: 𝑊 = 𝑈 𝑇 𝑉
30𝑘
+𝐼
4.37
2.91
1.80
• Performs better when training data is big
Bai et al 2009
Data with Multiple Views :
Advanced Topics
• Supervised Semantic Indexing
• Discriminative Reranking
• Multilingual Hashing
Discriminative Reranking
Input Sent.
Ref. output
Candidate
outputs
[Thede &
Harper, 99]
Buyers stepped in
to
the
NNS
VBD
IN
TO
DT
-0.1947
NNS
VBD
RP
TO
-6.8068
NNS
VBD
RB
-7.0514
NNS
VBD
-7.1408
NNS
-13.752
NNS
Score
futures pit
.
.
𝑥𝑖
NNS
NN
DT
NNS
NN
.
Loss
0.12
TO
DT
NNS
NN
.
0.12
IN
TO
DT
NNS
NN
.
0
VBD
RP
TO
DT
NNS
VB
.
0.25
VBD
RB
TO
DT
NNS
VB
.
0.25
𝑦𝑖
𝑦𝑖𝑗 𝑗 = 1 … 𝑚𝑖
• Approach
– Find a subspace that respects the preferences
–
Features are independent
𝜙(𝑥𝑖 , 𝑦𝑖𝑗 )
Discriminative𝑦 Reranking
𝑥
𝑑1
𝑥𝑖1
𝑦𝑖1
𝑥𝑖2
𝑦𝑖2
𝑥𝑖3
𝑥𝑖4
…
⋮
𝑦𝑖3
𝑑2
𝑦𝑖4
Vector of
length 𝑑1 × 𝑑2
…
〈𝑥𝑖1 𝑦𝑖1 , 𝑥𝑖1 𝑦𝑖2 , ⋯ ⋯ ⋯ ⋯ 𝑥𝑖𝑗 𝑦𝑖1 , 𝑥𝑖𝑗 𝑦𝑖2 , ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 〉
〈 w11 , w12 , ⋯ ⋯ ⋯ ⋯ wj1
, wj2 , ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 〉
𝑤𝑖𝑗 = 𝑎𝑖 𝑏𝑗
•
Reranker operates in the outer product space
– [Szedmak et al., 2006; Wang et al., 2007]
•
Weight vector is constrained [Bai et al. 10]
Low-Dimensional Reranking
• Find 𝐴 and 𝐵 s.t.
arg max cos(𝐴𝑇 𝑥, 𝐵𝑇 𝑦𝑗 ) ≡ 𝑦
𝑦𝑗
𝑥 𝑦
𝐴
𝑥
𝑦2
𝐵
𝑦2
𝑦
𝑦1
𝑦3
Low-Dimensional Reranking
• Find 𝐴 and 𝐵 s.t.
arg max cos(𝐴𝑇 𝑥, 𝐵𝑇 𝑦𝑗 ) ≡ 𝑦
𝑦𝑗
1.
Score: 𝑎𝑇 𝑥𝑖 𝑦𝑗𝑇 𝑏
Idea
2.
Add constraints to penalize incorrect ones
– 𝑠𝑐𝑜𝑟𝑒 𝑥𝑖 , 𝑦𝑖 ≥ 𝑠𝑐𝑜𝑟𝑒 𝑥𝑖 , 𝑦𝑖𝑗 + 1 −
– 𝑚𝑖𝑗 ≥ 1 −
𝜉𝑖
𝐿𝑖𝑗
𝜉𝑖
𝐿𝑖𝑗
[Tsochantaridis et al. 04]
Low-Dimensional Reranking
Discriminative
1−𝜆 𝑇 𝑇
arg max
𝑎 𝑋𝑌 𝑏 −
a,b,𝜉≥0 𝜆
𝜉𝑖
𝑖
𝑎𝑇 𝑋𝑋 𝑇 𝑎 = 1 and 𝑏 𝑇 𝑌𝑌 𝑇 𝑏 = 1
𝜉𝑖
𝑚𝑖𝑗 ≥ 1 −
𝐿𝑖𝑗
Softened-Disc
arg max 1 − 𝜆 𝑎𝑇 𝑋𝑌 𝑇 𝑏 + 𝜆
𝑎,𝑏
𝐿𝑖𝑗 𝑚𝑖𝑗
𝑖𝑗
s.t. length constraints
Discriminative Model
 𝛼𝑖𝑗 = 𝐿𝑖𝑗
// Initialization
 Repeat

𝐴 𝑖 ,𝐵
𝑖
← Softened-Disc X, Y, 𝛼𝑖𝑗
 𝑚𝑖𝑗 = 𝑎𝑇 𝑥𝑖 𝑦𝑖𝑇 𝑏 − 𝑎𝑇 𝑥𝑖 𝑦𝑖𝑗 𝑏
 𝜓𝑖𝑗 = 1 − 𝑚𝑖𝑗 𝐿𝑖𝑗
 𝜉𝑖 = min 0, 𝜓𝑖𝑗 s. t. 𝜓𝑖𝑗 > 0}
// Get the current soln.
// Compute margins
// Potential Slack
// Compute Slack
 If 𝜉𝑖 > 0
𝜉
 𝑑𝑖𝑗 = 𝑚𝑖𝑗 − 1 − 𝐿 𝑖
𝑖𝑗
 𝛼𝑖𝑗 ← 𝛼𝑖𝑗 − 𝛾𝑑𝑖𝑗
 End
 Until convergence
// Update the
Lagrangian variables
// Slack doesn’t change
POS Tagging
• Combine with Viterbi score
– Interpolation parameter is tuned
• Training
– Input sentence and Reference tag sequences
– Candidates, Score and Loss values
Buyers stepped in
to
the
futures pit
.
-0.1947
NNS
VBD
RP
TO
DT
NNS
NN
.
-6.8068
NNS
VBD
RB
TO
DT
NNS
NN
.
-7.0514
NNS
VBD
IN
TO
DT
NNS
NN
.
-7.1408
NNS
VBD
RP
TO
DT
NNS
VB
.
-13.752
NNS
VBD
RB
TO
DT
NNS
VB
.
Score
• Testing
POS Tagging
• Combine with Viterbi score
– Interpolation parameter is tuned
• Data Statistics
• Results
English
Chinese
French
Swedish
Baseline
96.15
92.31
97.41
93.23
Collins
96.06
92.81
97.35
93.44
Regularized
96.00
92.88
97.38
93.35
Oracle
98.39
98.19
99.00
96.48
POS Tagging
• Combine with Viterbi score
– Interpolation parameter is tuned
• Data Statistics
• Results
English
Chinese
French
Swedish
Baseline
96.15
92.31
97.41
93.23
Collins
96.06
92.81
97.35
93.44
Regularized
96.00
92.88
97.38
93.35
Softened-Disc
96.32
92.87
97.53
93.24
Discriminative
96.3
92.91
97.53
93.36
Oracle
98.39
98.19
99.00
96.48
POS Tagging
• Results continued …
English
Chinese
French
Swedish
Softened-Disc
+0.17
+0.56
+0.12
+0.01
Discriminative
+0.15
+0.6
+0.12
+0.13
Softened-Disc*
+0.92
+4.31
+1.12
+0.08
Discriminative*
+0.88
+4.77
+0.9
+0.73
• Interpolation with Viterbi score is crucial
• Softened-Disc
– Independent of no. training examples
– Easy to code and can be solved exactly
Jagarlamudi and Daumé 2012
Data with Multiple Views :
Advanced Topics
• Supervised Semantic Indexing
• Discriminative Reranking
• Multilingual Hashing
Motivation: Fuzzy Name Search
Aakash Anand
Directory in English
Query in English
…
Lusie Wanderbendie
Lucy Vanderwende
…
Zi Zhou
Employee Directory
Lucy Vanderwende
Lucia Vanderwend
Motivation: Multilingual Name Search
Aakash Anand
Directory in English
Query in Kannada
…
ಲೂಸಿ ವ ಾಂಡರ್‍ವ ಾಂಡಿ
Lucy Vanderwende
…
Zi Zhou
Employee Directory
Lucy Vanderwende
Lucia Vanderwend
Similarity Search: Challenges
• Computing nearest neighbors in high
dimensions using geometric search
techniques is very difficult
– All methods are as bad as brute force linear
search which is expensive
– Approximate techniques such as ANN perform
efficiently in dimensions as high as 20; in higher
dimensions, the results are rather spotty
• Need to do search on commodity hardware
• Cross-language search
Multilingual Hashing for Similarity Search
Lusie
1 0 1 1 1 1 1 1
X
Lucy
1 0 1 0 1 1 1 1
X X
Cynthia
Names
X X
0 1 1 1 0 1 0 0
Language-Independent
Hash Codes
Similarity
X X
Multilingual Hashing for Similarity Search
ಲೂಸಿ
1 0 1 1 1 1 1 1
X
Lucy
1 0 1 0 1 1 1 1
X X
ಸಿಾಂಥಿಯ
Names
X X
0 1 1 1 0 1 0 0
Language-Independent
Hash Codes
Similarity
X X
Search Overview
ಲೂಸಿ
1 0 1 1 1 1 1 1
Query
Hash Code
6
0 0 0 0 0 0 0 1
Aaron
5
0 0 0 1 1 0 0 0
Bharat
4
0 0 0 1 0 0 1 1
Cecile
3
0 0 1 0 1 0 1 1
David
3
0 0 1 1 0 1 0 1
Michael
6
0 1 0 0 0 1 1 0
Sanjay
2
0 1 1 1 1 1 1 1
Stuart
6
0 1 1 0 0 0 1 0
Daniel
5
1 0 0 0 0 0 0 1
Rashmi
4
1 0 0 0 0 1 0 1
Albert
1
1 0 1 1 0 1 1 1
Lucy
5
1 1 0 0 1 0 1 0
Kumar
Hamming
Distance
Hash Codes
Names
What is the advantage?
• Scales easily to very large databases
• Compact language-independent representation
• 32 bits per object
• Search is effective and efficient
• Hamming nearest-neighbor search
• Few milliseconds per query for searching a
million objects (single thread on a single
processor)
What is the challenge?
• Language/Script Independent Hash
Codes
• Learning Hash Functions from
Training Data
Hash Functions for Multilingual People
Search
Aaron
Bharat
Rick
David
Michael
Sanjay
Stuart
Daniel
Rashmi
Albert
Rashid
Kumar
ಆರನ್
ಭರತ್
ರಿಕ್
ಡ ೇವಿಡ್
ಮೈಕ ಲ್
g(Rick) = 10101100
h(ರಿಕ್) = 10101100
ಸಂಜಯ
ಸಟೂವರ್ಟ್
ಡ ೇನಿಯಲ್
ರಶ್ಮಿ
ಆಲ್ಬರ್ಟ್
ರಶ್ಮೇದ್
ಕುಮಾರ್
Training Data
g(Rashid) = 10111111
h(ರಶ್ಮೇದ್
= 11111111
Hash)Functions
Parallel Names  Similar Hash Codes
Feature Vectors
^R Ra
as
sh
hi
id
d$
ic
…
1
1
1
1
1
1
0
…
ೀ ದ
ದೀ
ೀ$
ಆಲ
…
1
1
1
0
…
1
^ರ
ರಶ
1
1
ಶೀ
1
Character Bigram Features
1-Bit Hash Function
• Linear projection followed by thresholding at 0.
Aaron
-1
Bharat
+1
Rick
-1
Rashid
+1
Kumar
+1
O
K-Bit Hash Functions
• Composed of K 1-bit hash functions.
Learning Hash Functions
Bits are
uncorrelated
Equivalent to
Minimizing
Hamming Distance
50% bits are +1,
50% are -1
Learning Hash Functions (contd.)
Linear Relaxation
Learning Hash Functions
Canonical Correlation Analysis (Hoteling, 1936)
Learning Hash Functions: Summary
• Given a set of parallel names as training data, find
the top K projection vectors for each language
using Canonical Correlation Analysis.
• Each projection vector gives a 1-bit hash function.
• Hash code for a name can be computed by
projecting its feature vector on to the projection
vectors followed by binarization.
Udupa & Kumar, 2010
Fuzzy Name Search: Experimental Setup
•Test Sets:
•DUMBTIONARY
•1231 misspelled names
•INTRANET
•200 misspelled names
•Name Directories:
•DUMBTIONARY
•550K names from Wikipedia
•INTRANET
•150K employee names
•Training Data:
•15K pairs of single token names in English and Hindi
•Baselines:
•Two popular search engines, Double Metaphone, BM25
Fuzzy Name Search Results on
DUMBTIONARY
0
System
DD
S1
86.12
S2
79.33
DM
78.95
BM25
84.70
M-Hash
87.93
B-Hash
92.53
Precision@1
Very Bad
1
Perfect
Competitor
Hash
Fuzzy Name Search Results on INTRANET
0
System
DD
DM
54.00
BM25
56.92
M-Hash
70.65
B-Hash
77.79
Precision@1
1
Very Bad
Perfect
Competitor
Hash
Multilingual: Experimental Setup
•Test Sets
• 1000 multi-word names each in Russian, Hebrew,
Kannada, Tamil, Hindi
•Name Directory:
•English Wikipedia Titles
•6 Million Titles, 2 Million Unique Words
•Baseline:
•State-of-the-art Machine Transliteration (NEWS
2009)
Multilingual: Experimental Results
Algorithm
Russian
Hebrew
Kannada
Tamil
Hindi
Transliteration
0.48
-
0.52
0.29
0.49
B-Hash
0.67
0.69
0.68
0.68
0.69
0
Precision@1
Very Bad
1
Perfect
Competitor
Hash
Road Map
• Introduction
• NLP and Dimensionality Reduction
• Mathematical Background
• Data with Single View
• Techniques
• Applications
• Advanced Topics
• Data with Multiple Views
• Techniques
• Applications
• Advanced Topics
• Summary
Summary
Dimensionality Reduction
#of dim. reduction papers
800
Vision
700
NLP
600
500
400
300
200
100
0
1990
1995
2000
2005
2010
Dimensionality Reduction
1.2
Vision
1
NLP
0.8
Popularity compared to 0.6
Bayesian approaches
0.4
0.2
0
1990
1995
2000
2005
2010
Summary
• Dimensionality reduction has merits for NLP
– Computational and Feature correlations
• Has been explored in unsupervised fashion
– But recent novel developments
• For multi-view data
• If you can formulate your problem as mapping
– Try dimensionality reduction
– Can solve for the global optimum
Summary
• Spectral Learning
– Provides a way to learn global optimum for
generative models
• Enriching the existing models
– Using word embeddings instead of words
• Scalability of the techniques
– Doesn’t depend on the number of examples
– Large scale SVD
References
•
•
•
•
•
•
•
•
•
•
•
•
HOTELLING, Harold, 1933. Analysis of a Complex of Statistical Variables into Principal Components. Journal
of Educational Psychology, 24(6 & 7), 417–441 & 498–520.
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), Numerical Recipes: The Art of Scientific
Computing (3rd ed.), New York: Cambridge University Press
John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from
discriminative projections. In EMNLP ’10, pages 251–261
Hyvärinen A.; Oja E.; Independent component analysis: algorithms and applications, Journal of Neural
Networks, Volume 13 Issue 4-5, May-June 2000, Pages 411 - 430
Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Lafferty, John. ed. "Latent Dirichlet allocation". Journal of
Machine Learning Research 3 (4–5), 2003: pp. 993–1022
S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol
290, 22 December 2000, 2323–2326.
Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24, 1, 97–124.
Hinrich Schütze. 1995. Distributional part-of-speech tagging. In EACL 7, pages 141--148.
Michael Lamar, Yariv Maron, Mark Johnson, Elie Bienenstock, SVD and clustering for unsupervised POS
tagging, In ACL 2010, p.215-219, July 11-16
Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the
51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36–40.
Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology
38: 188.
Thomas Hofmann, Probabilistic Latent Semantic Indexing, In SIGIR 1999
References
•
•
•
•
•
•
•
•
•
•
J. B. Tenenbaum, Vin de Silva, and John C. Langford, “A Global Geometric Framework for Nonlinear
Dimensionality Reduction,” Science, vol 290, 22 December 2000.
Belkin,M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 15, 1373–1396.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of
Artificial Intelligence Research.
Yoshua Bengio , Réjean Ducharme , Pascal Vincent , Christian Janvin, A neural probabilistic language
model, The Journal of Machine Learning Research, 3, 3/1/2003
Joseph Turian , Lev Ratinov , Yoshua Bengio, Word representations: a simple and general method for semisupervised learning, In ACL 2010, p.384-394, July 11-16, 2010
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing
(almost) from scratch. JMLR, 2011.
Harold Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322–377.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from
monolingual corpora. In ACL, pages 771--779.
Hal Daumé, III , Jagadeesh Jagarlamudi, Domain adaptation for machine translation by mining unseen
words, In ACL 2011, June 19-24
Bing Bai , Jason Weston , David Grangier , Ronan Collobert , Kunihiko Sadamasa , Yanjun Qi , Olivier
Chapelle , Kilian Weinberger, Supervised semantic indexing, In CIKM 2009, November 02-06, Hong Kong,
China
References
•
•
•
•
•
Jagadeesh Jagarlamudi, Hal Daumé, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012
Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in
IJCAI-11, IJCAI, 20 July 2011
Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of Personal
Names, in Proceedings of EMNLP 2010, October 2010
Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,
in ECIR 2010, 2010
Jagadeesh Jagarlamudi, Hal Daumé, III , Regularized Interlingual Projections: Evaluation on Multilingual
Transliteration, in Proceedings of EMNLP-CoNLL 2012.