Download Data Mining - Computer Science Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
Text Mining: Text-as-Data
March 25, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Text Mining: Text-as-Data
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
March 25, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Word Frequencies
Relevance Scoring
Latent Semantic Indexing
Markov Models
Text Mining: Text-as-Data
March 25, 2009
Slide 3
COMP527:
Data Mining
Word Frequencies
Unlike other 'regular' attributes, each word can appear multiple
times in a single document. More frequently occurring words
within a single document good indication that it's important to
that text.
Also interesting to see overall distribution of words within the full
data set, and within the vocabulary/attribute space/vector space.
Distribution of parts of speech interesting.
Even individual letter frequencies are potentially interesting
between different texts or sets of texts.
Text Mining: Text-as-Data
March 25, 2009
Slide 4
COMP527:
Data Mining
Letter Frequencies
Distribution of letters in a text can potentially show with very low
dimensionality some minimal features of the text.
Eg:
Alice W'land: E T A O I H N S R D L U W G C Y M F P B
Holy Bible: E T A H O N I S R D L F M U W C Y G B P
Reuters:
E T A I N S O R L D C H U M P F G B Y W
Tale 2 Cities: E T A O N I H S R D L U M W C F G Y P B
Eg 'C' and 'S' is a lot more common in the Reuters news articles, 'H' very
uncommon. 'F' more common in the bible.
Text Mining: Text-as-Data
March 25, 2009
Slide 5
COMP527:
Data Mining
Letter Frequencies
Distribution of letters in a text can potentially also help with
language determination.
'E' a lot more common in French (0.17) than Spanish (0.133) or
English (0.125). 'V' and 'U' also more common in French. Top
three in Spanish are all vowels: 'E' then 'A' then 'O'.
Quite possible that the distribution of letters in texts to be classified
is also interesting, if they're from different styles, languages, or
subjects. Don't rule out the very easy :)
Text Mining: Text-as-Data
March 25, 2009
Slide 6
COMP527:
Data Mining
Word Frequencies
Distribution of words interesting. For vector construction, it would
be nice to know approximately how many unique words there are
likely to be.
Heaps's Law: v = Knb
Where:
n = number of words
K = constant between 10 and 100
b = constant between 0 and 1, normally 0.4 and 0.6
and v is the size of the vocabulary
While this seems very fuzzy, it often works in practice. Eg predicts
a particular curve, which seems to hold up in experiments.
Text Mining: Text-as-Data
March 25, 2009
Slide 7
COMP527:
Data Mining
Word Frequencies
A second 'law': Zipf's Law
Idea: We use a few words very often, and most words very rarely,
because it's more effort to use a rare word.
Zipf's Law: Product of frequency of word and its rank is
[reasonably] constant.
Also fuzzy, but also empirically demonstrable. And holds up over
different languages.
A 'Power Law Distribution' – few events occur often, and many
events occur infrequently.
Text Mining: Text-as-Data
March 25, 2009
Slide 8
COMP527:
Data Mining
Word Frequencies
Zipf's Law Example:
Word
Rank
Freq.
Rank*F
Word
Rank
Freq.
Rank*F
the
1
120021
120021
investors
400
828
331200
of
2
72225
144450
head
800
421
336800
and
4
53462
213848
warrant
1600
184
294400
for
8
25578
204624
Tehran
3200
73
233600
is
16
16739
267824
guarantee
6400
25
160000
company
32
9340
298880
Pittston
10000 11
110000
Co.
64
4005
256320
thinly
20000 3
60000
quarter
100
2677
267700
Morgenthaler
40000 1
40000
unit
200
1489
297800
tabulating
47075 1
47075
Text Mining: Text-as-Data
March 25, 2009
Slide 9
COMP527:
Data Mining
Word Frequencies
The frequencies of words can be used with relation to each class in
comparison to the entire document set.
Eg words that are more frequent in a class than in the document
set as a whole are discriminating.
Can use this idea to generate weights for terms against each class,
and then merge the weights for a general prediction of class.
Also commonly used for search engines to predict relevance to
user's query.
Several different ways to create these weights...
Text Mining: Text-as-Data
March 25, 2009
Slide 10
COMP527:
Data Mining
TF-IDF
Term Frequency, Inverse Document Frequency
w(i, j) = tf(i,j) * log(N / df(i))
Weight of term i in document j is the frequency of term i in
document j times the log of the total number of documents
divided by the number of documents that contain term i.
Eg, the more often the term occurs in the document, and the rarer
the term is, the more likely that document is to be relevant.
Text Mining: Text-as-Data
March 25, 2009
Slide 11
COMP527:
Data Mining
TF-IDF Example
w(i, j) = tf(i,j) * log(N / df(i))
In 1000 documents, 20 contain the word 'lego'. It appears between
1 and 6 times in those 20.
For the freq 6 document:
w('lego', doc) = 6 * log(1000 / 20)
= 6 * log(50)
= 23.47
Text Mining: Text-as-Data
March 25, 2009
Slide 12
COMP527:
Data Mining
TF-IDF
Then for multiple terms, merge the weightings between each tfidf
value according to some function. (sum, mean, etc). Can
generate this sum for each class.
Pros:
 Easy to implement
 Easy to understand
Cons:
 Document Size not taken into account
 Low document frequency overpowers term frequency
Text Mining: Text-as-Data
March 25, 2009
Slide 13
COMP527:
Data Mining
CORI
Jamie Callan of CMU proposed this algorithm:
I = log((N + 0.5) / tf(i)) / log(N + 1.0)
T = df(i) / df(i)+50+ (150* size(j) / avgSize(N))
w(i,j) = 0.4 + (0.6 * T * I)
Takes into account document size, and average size of all
documents.
Otherwise a document with 6 matches in 100 words is treated the
same as a document with 6 matches in 100,000 words.
Vast improvement over simple TF-IDF, while still remaining easy to
implement and understandable.
Text Mining: Text-as-Data
March 25, 2009
Slide 14
COMP527:
Data Mining
CORI Example
I = log((N + 0.5) / tf(i)) / log(N + 1.0)
T = df(i) / df(i)+50+ ( 150*size(j) / avgSize(N))
w(i,j) = 0.4 + (0.6 * T * I)
I = log(1000.5 / 6) / log(1001) = 0.74
T = 20 / 20 + 50 + (150 * 350 / 500) = 0.11
w('lego', doc) = 0.4 + (0.6 * T * I) = 0.449
Given the same 20 matched docs, 6 in doc, 1000 documents, 350 words in
doc, and an average of 500 words per doc in the 1000.
For more explanations see his papers:
http://www.cs.cmu.edu/~callan/Papers/
Text Mining: Text-as-Data
March 25, 2009
Slide 15
COMP527:
Data Mining
Latent Semantic Indexing
Finds the relationships between words in the document/term matrix: the
clusters of words that frequently co-occur in documents, and hence the
'latent semantic structure' of the document set.
Doesn't depend on individual words, but instead on the clusters of words.
Eg might use 'car' + 'automobile' + 'truck' + 'vehicle' instead of just 'car'
Twin problems:
Synonymy: Different words with the same meaning (car, auto)
Polysemy: Same spelling with different meaning (to ram, a ram)
(We'll come back to word sense disambiguation too)
Text Mining: Text-as-Data
March 25, 2009
Slide 16
COMP527:
Data Mining
Latent Semantic Indexing
Based on Singular Value Decomposition of the matrix (which is
something best left to math toolkits).
Basically: Transforms the dimensions of the vectors such that documents
with similar sets of terms are closer together.
Then use these groupings as clusters of documents.
You end up with fractions of words being present in documents (eg
'automobile' is somehow present in a document containing 'car').
Then use these vectors for analysis, rather than straight frequency
vectors. As each dimension is multiple words, end up with smaller
vectors too.
Text Mining: Text-as-Data
March 25, 2009
Slide 17
COMP527:
Data Mining
Markov Models
Patterns of letters in language don't happen at random. This sentence vs
kqonw ldpwuf jghfmb edkfiu lpqwxz. Obviously not language.
Markov models try to learn the probabilities of one item following
another, in this case letters.
Eg: Take all of the words we have and build a graph of which letters
follow which other letters and 'start' and 'end' of words.
Then each arc between nodes has a weight for the probability.
Using a letter based markov model we might end up with words like:
annofed, mamigo, quarn, etc.
Text Mining: Text-as-Data
March 25, 2009
Slide 18
COMP527:
Data Mining
Markov Models
Sequences of words are also not random (in English). We can use a much
much larger Markov Model to show the probabilities of a word
following another word or start/end of sentence.
Equally, words clump together in short phrases, and we could use multiword tokens as our graph nodes.
Here we could see how likely 'states' is to follow 'united' for example.
Text Mining: Text-as-Data
March 25, 2009
Slide 19
COMP527:
Data Mining
Hidden Markov Models
Sequences of parts of speech for words are also not random (in English).
But we don't care about the probabilities, we want to potentially use the
observations as a way to determine the actual part of speech of a word.
This is a Hidden Markov Model (HMM) as it uses the observable patterns
to predict some variable which is hidden.
Uses a trellis of state and observation sequence. eg:
O1
O2
O3
O4
state 1
state 2
Text Mining: Text-as-Data
March 25, 2009
Slide 20
COMP527:
Data Mining
Hidden Markov Models
Stores the calculations towards the probabilities in the trellis arcs.
Various clever algorithms to make this computationally feasible:
Compute probability of particular output sequence:
Forward-Backward algorithm.
Find most likely sequence of hidden states to generate output sequence:
Viterbi algorithm
Given an output sequence, find most likely set of transitions and output
probabilities (train the parameters for model given training set):
Baum-Welch algorithm
Text Mining: Text-as-Data
March 25, 2009
Slide 21
COMP527:
Data Mining






Further Reading
Konchady
Chou, Juang, Pattern Recognition in Speech and Language
Processing, Chapters 8,9
Weiss
Berry, Survey, Chapter 4
Han, 8.4.2
Dunham 9.2
Text Mining: Text-as-Data
March 25, 2009
Slide 22