Download TANA07: Data Mining using Matrix Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix (mathematics) wikipedia , lookup

Jordan normal form wikipedia , lookup

Four-vector wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Principal component analysis wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Ordinary least squares wikipedia , lookup

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Transcript
TANA07: Data Mining using Matrix Methods
Text mining – Information retrieval
Lars Eldén and Berkant Savas
Department of Mathematics
Linköping University, Sweden
2012
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Contents
1
Information retrieval
2
Latent semantic indexing
Reference text: M. Berry and M. Browne. Understanding Search
Engines: Mathematical Modeling and Text Retrieval, Second
edition, SIAM, 2005.
Statistics programming environment: “Text-mining toolbox” has
routines for LSI and cluster-based information retrieval
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Example of Software I
SAS Text Miner provides full text preprocessing within the
powerful, easy-to-use process flow environment of Enterprise
Miner. This enables users to enrich the overall data mining process
by integrating unstructured textual data with existing structured
data such as age, income and purchasing patterns. Features include:
Access to numerous textual formats including PDF, extended ASCII, HTML ...
Numerous text preprocessing methods such as stemming, noun group extraction,
user-defined synonym lists, multiword tokens and part-of-speech tagging.
Extensive feature extraction capabilities with broad customizable data
dictionaries.
Singular value decomposition for dimension reduction.
Unique clustering algorithms.
http://www.sas.com/technologies/analytics/datamining/textminer
TMG http://scgroup20.ceid.upatras.gr:8000/tmg/
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Example
Search in LiU library journal catalogue:
Result
Search phrase
computer science engineering
Nothing found
computing science engineering IEEE: Computing in Science and En
Straightforward word matching is not good enough!
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Information Retrieval
Data base of documents: for a query with a set of keywords, find
all documents that are relevant.
Applications: Data bases of scientific abstracts, web search
engines.
SMART: System for mechanical analysis and retrieval of text,
Gerard Salton 1983.
Vector space model information retrieval
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Vector space IR model
Term-by-document matrix:
Term 1
Term 2
Term 3
Doc 1
1
0
0
Doc 2
0
0
1
Doc 3
1
1
1
Doc 4
0
1
0
Query
1
1
0
The documents and the query are represented by a vector in Rm
(here m = 3)
Query close to document vectors? Use some distance measure in
Rm .
In applications m ≈ 106 is common.
Use linear algebra methods (e.g. SVD) for data compression and
retrieval enhancement.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Latent Semantic Indexing (LSI)
More than literal matching: conceptual-based modelling
1
Document file preparation:
1
2
3
Indexing: collect terms
Use stop list: eliminate “meaningless” words
Stemming
2
Constructing the term-by-document matrix, sparse matrix
storage
3
Query matching: distance measures
4
Data compression by low rank approximation: SVD
5
Ranking and relevance feedback
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Stop List I
Eliminate words that occur in “all documents”:
We consider the computation of an eigenvalue and
corresponding eigenvector of a Hermitian positive definite
matrix A ∈ Cn×n , assuming that good approximations of the
wanted eigenpair are already available, as may be the case in
applications such as structural mechanics. We analyze
efficient implementations of inexact Rayleigh quotient–type
methods, which involve the approximate solution of a linear
system at each iteration by means of the Conjugate Residuals
method.
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Stop list II
Beginning of the Cornell list:
a, a’s, able, about, above, according, accordingly, across, actually,
after, afterwards, again, against, ain’t, all, allow, allows, almost,
alone, along, already, also, although, always, am, among, amongst,
an, and, another, any, anybody, anyhow, anyone, anything,
anyway, anyways, anywhere, apart, appear, appreciate, appropriate,
are, aren’t, around, as, aside, ask, asking, associated, at, available,
away, awfully, b, be, became, because, become, becomes,
becoming, been, before, beforehand, behind, being, believe, below,
beside, besides, best, better, between, beyond, both, brief, but, by,
c, c’mon, c’s, came, can,...
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Stemming

computable 


computation
computing 


computational
=⇒
comput
http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm
http://www.tartarus.org/~martin/PorterStemmer/
http://snowball.tartarus.org/algorithms/swedish/stemmer.html
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Query
Also queries are preprocessed =⇒ natural language queries
I want to know how to compute singular values of data
matrices, especially such that are large and sparse.
becomes
comput singular value data matri large sparse
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Example
Example (Porter stemmer)
2. changes of the nucleic acid and phospholipid levels of the livers in the
course of fetal and postnatal development . we have followed the
evolution of dna, rna and pl in the livers of rat foeti removed between the
fifteenth and the twenty-first day of gestation and of young rats
newly-born or at weaning . we can observe the following facts.. 1. dna
concentration is 1100 ug p on the 15th day, it decreases from the 19th
day until it reaches a value of 280 ug 5 days after weaning .
becomes
2. chang of the nucleic acid and phospholipid level of the liver in the
cours of fetal and postnat develop . we have follow the evolut of dna, rna
and pl in the liver of rat foeti remov between the fifteenth and the
twenti-first dai of gestat and of young rat newli-born or at wean . we can
observ the follow fact.. 1. dna concentr is 1100 ug p on the 15th dai, it
decreas from the 19th dai until it reach a valu of 280 ug 5 dai after wean
.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Inverted File Structures
Document file: Each document has a number and all terms
are identified
Dictionary: Sorted list of all unique terms
Inversion List: Pointers from a term to the documents that
contain that term (column index for non-zeros in a row of the
matrix).
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Example
Document
Document
Document
Document
1:
2:
3:
4:
Document 5:
The Google matrix P is a model of the Internet.
Pij is nonzero if there is a link from web page j to i.
The Google matrix is used to rank all web pages
The ranking is done by solving a matrix eigenvalue
problem.
England dropped out of the top 10 in the FIFA
ranking.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Term–document matrix I
Count the frequency of terms in each document:
Term
eigenvalue
England
FIFA
Google
Internet
link
matrix
page
rank
web
Doc 1
0
0
0
1
1
0
1
0
0
0
Doc 2
0
0
0
0
0
1
0
1
0
1
Copyright Lars Eldén, Berkant Savas 2012
Doc 3
0
0
0
1
0
0
1
1
1
1
Doc 4
1
0
0
0
0
0
1
0
1
0
Doc 5
0
1
1
0
0
0
0
0
1
0
TANA07: Data Mining using Matrix Methods
Term–Document Matrix II
0
0

0

1

1
A=
0

1

0

0
0

0
0
0
0
0
1
0
1
0
1
0
0
0
1
0
0
1
1
1
1
1
0
0
0
0
0
1
0
1
0

0
1

1

0

0
 ∈ Rm×n
0

0

0

1
0
aij is the weighted frequency of term i in document j
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Simple Query Matching
Given a query vector q, find the columns aj of A, which have
dist(d, aj ) ≤ tol
Common distance definitions:
arccos(θ(x, y )) = arccos(x T y ),
dist(x, y ) =
kx − y k2
Copyright Lars Eldén, Berkant Savas 2012
(note: kxk2 = ky k2 = 1 )
Euclidean distance
TANA07: Data Mining using Matrix Methods
Query
Query: “ranking of web pages”.
query vector
 
0
0
 
0
 
0
 
0
10

q=
0 ∈ R .
 
0
 
1
 
1
1
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Cosine measure
Cosine measure: cos(θ(x, y )) =
Cosines:
xT y
kxk2 ky k2
≥ tol.
0 0.6667 0.7746 0.3333 0.3333 .
With a cosine threshold of tol = 0.4 documents 2 and 3 would be
returned.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Sparse Matrix
0
50
100
150
200
250
300
350
400
450
500
0
100
200
300
nz = 2676
400
500
Figure: The first 500 rows and columns of the Medline matrix. Each dot
represents a non-zero element.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Sparse Matrix Storage
Typically only 1% of the matrix elements are
sparse
Example:

0.6667
0
0
 0
0.7071
0.4082
A=
0.3333
0
0.4082
0.6667
0
0
non-zero: matrix is

0.2887
0.2887

0.2887
0
Compressed row storage:
val
col-ind
row-ptr
0.666
1
1
0.288
4
3
0.707
2
6
0.408
3
9
0.288
4
0.333
1
0.408
3
0.288
4
Compressed column storage: analogous
Sparse matrices are automatic in Matlab
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
0.666
1
Performance Evaluation
Precision:
P=
Dr
,
Dt
Dr : number of relevant documents retrieved
Dt : total number of documents retrieved
Recall
R=
Dr
Nr
Nr : Total number of relevant documents in the data base
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Performance evaluation II
RETURNED
RELEVANT
DOCUMENTS
DOCUMENTS
ALL DOCUMENTS
Figure: Returned and relevant documents for two values of the tolerance:
The dashed circle represents the retrieved documents for a high value of
the cosine tolerance.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Performance evaluation III
T
y
Cosine measure: cos(θ(x, y )) = kxkx2 ky
k2 ≥ tol.
Large value of tol(≤ 1): High precision, but low recall
Small value of tol: High recall, but low precision
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Our test query
Query Q9 in the Medline collection:
9. the use of induced hypothermia in heart surgery,
neurosurgery, head injuries and infectious diseases.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Precision-recall graph
Query matching for Q9 in the Medline collection (stemmed) using
the cosine measure
100
90
80
Precision (%)
70
60
50
40
30
20
10
0
0
20
40
60
Recall (%)
Copyright Lars Eldén, Berkant Savas 2012
80
100
TANA07: Data Mining using Matrix Methods
Singular Value Decomposition (SVD)
A ∈ Rm×n ,
m≥n
Σ
A=U
VT,
0
U ∈ Rm×m , Σ ∈ Rn×n ,
V ∈ Rn×n ,
U and V are orthogonal, and Σ is diagonal
Σ = diag(σ1 , σ2 , . . . , σn ),
σ1 ≥ σ2 ≥ · · · ≥ σr > σr +1 = · · · = σn = 0
Rank(A)=r .
P
Σ
Sum of rank 1 matrices: A = U
V T = ni=1 σi ui viT .
0
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Singular values of A
Google example: A has rank 5 (full column rank).
3
2.5
2
1.5
1
0.5
0
0
1
2
Copyright Lars Eldén, Berkant Savas 2012
3
4
5
6
TANA07: Data Mining using Matrix Methods
Singular values of Medline matrix
5
4.5
4
Singular values
3.5
3
2.5
2
1.5
1
0.5
0
0
20
40
60
80
100
First 100 singular values
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Matrix approximation by SVD
P
Define the Frobenius matrix norm kAkF = ( i,j aij2 )1/2 , and
assume k ≤ r .
Theorem (Eckart-Young 1936)
The approximation problem
min
rank(Z )=k
kA − Z kF
has the solution
Z=
k
X
σi ui viT = Uk Σk VkT
i=1
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Solution
Solution:
Z=
k
X
σi ui viT = Uk Σk VkT =
i=1
Uk = (u1 , . . . uk ),
Σk = diag(σ1 , . . . , σk ),
Copyright Lars Eldén, Berkant Savas 2012
Vk = (v1 , . . . vk )
TANA07: Data Mining using Matrix Methods
First Singular Vectors: Medline
find(abs(U(:,k))>0.13)
Look-up in the dictionary of terms
U(:,1)
cell
growth
hormon
patient
Copyright Lars Eldén, Berkant Savas 2012
U(:,2)
case
cell
children
defect
dna
growth
patient
ventricular
TANA07: Data Mining using Matrix Methods
Basis Vectors
Use the columns of Uk as new basis vectors in the document space.
Express all the documents in terms of the new basis:
min kA − Uk DkF
D
Least squares problem, solution
D =
(UkT Uk )−1 UkT A
=
UkT A
=
UkT
n
X
σi ui viT
i=1
=
k
X
σi ei viT = Σk Vk =: Dk
i=1
Dk = UkT A is the projection of A onto the subspace spanned by Uk .
Column j of Dk holds the coordinates of document j in the new
basis.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
SVD – PCA
Data Compression by SVD ≡ Principal Components Analysis
A ≈ Ak = Uk (Σk Vk ) =: Uk Dk =
Dk holds the coordinates of all documents in terms of the k first
singular vectors u1 , . . . uk (principal components).
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Coordinates
Dk holds the coordinates of all documents in terms of the k first
singular vectors u1 , . . . , uk (principal components). Here k = 2:
1
5
4
0.5
1
u
2
0
3
−0.5
q
−1
2
−1.5
0.5
1
1.5
u1
2
2.5
Query: “ranking of web pages”.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Query Matching after Data Compression I
Data compression: represent term-document matrix by
Ak = Uk Dk .
Compute q T Ak = q T Uk Dk = (UkT q)T Dk .
Project the queery: qk := UkT q
Cosines:
qkT (Dk ej )
cos θj =
kqk k2 kDk ej k2
Cosines for query and original data:
0 0.6667 0.7746 0.3333 0.3333 .
After projection to the two-dimensional subspace:
0.7857 0.8332 0.9670 0.4873 0.1819
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
LSI (SVD compression) enhances retrieval quality
Document
Document
Document
Document
1:
2:
3:
4:
Document 5:
The Google matrix P is a model of the Internet.
Pij is nonzero if there is a link from web page j to i.
The Google matrix is used to rank all web pages
The ranking is done by solving a matrix eigenvalue
problem.
England dropped out of the top 10 in the FIFA
ranking.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
First Two Singular Vectors

0.1425
0.0787


0.0787


0.3924


0.1297

,
u1 = 

0.1020
0.5348


0.3647


0.4838

0.3647
Copyright Lars Eldén, Berkant Savas 2012

0.2430
 0.2607 


 0.2607 


−0.0274


 0.0740 


−0.3735 ,


 0.2156 


−0.4749


 0.4023 

−0.4749
TANA07: Data Mining using Matrix Methods
Recall vs. precision for Medline, Q9
100
90
80
Precision (%)
70
60
50
40
30
20
10
0
0
20
40
60
Recall (%)
80
100
Full vector space model (solid line), the rank 100 approximation
(dashed).
Error in Matrix Approximation kA − Ak kF /kAkF ≈ 0.8
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Latent Semantic Indexing
Represent the data matrix in terms of singular vectors.
Jessup & Martin: “Rank reduction removes the noise that obscures
the semantic content of the data.”
Park et al.: “LSI is based on the assumption that there is some
underlying latent semantic structure in the data ... that is
corrupted by the wide variety of words used ...
Berry p. 57: “automatic association of related terms”
The vector space model and LSI can deal with synonymy and
polysemy (cf. plain word matching)
Synonyms Different words with the same meaning: Football,
Soccer
Polysemy The same word has different meanings
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Latent Semantic Indexing
“A key feature of LSI is its ability to extract the conceptual
content of a body of text by establishing associations between
those terms that occur in similar contexts”
“... uncovers the underlying latent semantic structure in the usage
of words in a body of text”
“LSI will return results that are conceptually similar in meaning to
the search criteria even if the results dont share a specific word or
words with the search criteria.”
Source:
http://en.wikipedia.org/wiki/Latent_semantic_indexing
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
Text Parser TMG I
TMG - Text to Matrix Generator
TMG parses a text collection and generates the term - document matrix.
A = TMG(FILENAME) returns the term - document matrix, that corresponds
to the text collection contained in files of directory (or file) FILENAME.
Each document must be separeted by a blank line (or another delimiter
that is defined by OPTIONS argument) in each file.
[A, DICTIONARY] = TMG(FILENAME) returns also the dictionary for the
collection, while [A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZED_FACTORS]
= TMG(FILENAME) returns the vectors of global weights for the dictionary
and the normalization factor for each document in case such a factor is used.
If normalization is not used TMG returns a vector of all ones.
[A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC] =
TMG(FILENAME) returns statistics for each document, i.e. the number of
terms for each document.
....
Copyright 2004 Dimitrios Zeimpekis, Efstratios Gallopoulos
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods
M. W. Berry and M. Browne.
Understanding Search Engines. Mathematical Modeling and Text
Retrieval.
SIAM, Philadelphia, PA, second edition, 2005.
E. R. Jessup and J. H. Martin.
Taking a new look at the latent semantic analysis approach to
information retrieval.
In M. W. Berry, editor, Computational Information Retrieval, pages
121–144, Philadelphia, PA, 2001. SIAM.
H. Park, M. Jeon, and J. B. Rosen.
Lower dimensional representation of text data in vector space based
information retrieval.
In M. W. Berry, editor, Computational Information Retrieval, pages
3–23, Philadelphia, PA, 2001. SIAM.
Copyright Lars Eldén, Berkant Savas 2012
TANA07: Data Mining using Matrix Methods