Download Course on Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course on Data Mining (581550-4)
Intro/Ass. Rules
7.11.
24./26.10.
Clustering
14.11.
Episodes
KDD Process
Home Exam
30.10.
Text Mining
21.11.
28.11.
Appl./Summary
Course on Data Mining
Page
1/70 1
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course on Data Mining (581550-4)
Today 07.11.2001
• Today's subject:
– Text Mining, focus on maximal
frequent phrases or maximal
frequent sequences (MaxFreq)
• Next week's program:
– Lecture: Clustering,
Classification, Similarity
– Exercise: Text Mining
– Seminar: Text Mining
Course on Data Mining
Page
2/70 2
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Mining
Background
What is Text Mining?
MaxFreq Sequences
MaxFreq Algorithms
MaxFreq Experiments
Course on Data Mining
Page
3/70 3
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Databases and
Information Retrieval
• Text databases (document databases)
– Large collections of documents from various sources:
news articles, research papers, books, digital libraries,
e-mail messages, Web pages, etc.
• Information retrieval (IR)
– Information is organized into (a large number of)
documents
– Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
Course on Data Mining
Page
4/70 4
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Basic Measures for Text Retrieval
Relevant
Relevant &
Retrieved
Retrieved
All
Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
| {Relevant}  {Retrieved} |
=
precision
| {Retrieved } |
Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
| {Relevant}  {Retrieved} |
=
recall
| { Relevant } |
Course on Data Mining
Page
5/70 5
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Keyword/Similarity-Based Retrieval
• A document is represented by a string, which can be
identified by a set of keywords
• Find similar documents based on a set of common
keywords
• Answer should be based on the degree of relevance
based on the nearness of the keywords, relative
frequency of the keywords, etc.
• In the following, some basic techniques related to the
preprocessing and retrieval are briefly mentioned
Course on Data Mining
Page
6/70 6
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Keyword/Similarity-Based Retrieval
• Basic techniques (1): Remove unrelevant words with
stop list
– Set of words that are deemed “irrelevant”, even though
they may appear frequently
– E.g., a, the, of, for, with, etc.
– Stop lists may vary when document set varies
• Basic techniques (2): Take basic forms of words with
word stemming
– Several words are small syntactic variants of each other
since they share a common word stem (basic form)
– E.g., drug, drugs, drugged
Course on Data Mining
Page
7/70 7
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Keyword/Similarity-Based Retrieval
• Basic techniques (3): Calculate occurrences of terms to
a term frequency table
– Each entry frequent_table(i, j) = # of occurrences of
the word ti in document di (or just "0" or "1" )
• Basic techniques (4): Similarity metrics: measure the
closeness of a document to a query (a set of keywords)
v v
– Cosine distance: sim(v1 , v2 ) = 1 2
| v1 || v2 |
– Relative term occurrences
• This is all nice to know, but where is the text mining
and how does it relate to this?
Course on Data Mining
Page
8/70 8
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Mining
Background
What is Text Mining?
MaxFreq Sequences
MaxFreq Algorithms
MaxFreq Experiments
Course on Data Mining
Page
9/70 9
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
What is Text Mining?
• Data mining in text: find something useful and
surprising from a text collection
• Text mining vs. information retrieval is like data
mining vs. database queries
Course on Data Mining
Page
10
10/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Different Views on Text
• For example, we might have the following text:
Documents are an interesting application field
for data mining techniques.
• Remember the market basket data?
– The text can then be considered as a shopping transaction, i.e.,
row in the database
– The words occurring in the text can be considered as items bought
Transaction ID Items Bought
100
A,B,C
200
A,C
Document ID Words occurring
100
an,application,...
200
...
Course on Data Mining
Page
11
11/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Different Views on Text
• Recall the event sequence from episode rules:
D C A
0
10
B
20 30 40
D A B
C
50 60 70 80 90
are
an
interesting
application
field
for
data
mining
techniques
0
Documents
• Now we can consider the text as a sequence of words!
1
2
3
4
5
6
7
8
9
10
11
Course on Data Mining
Page
12
12/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Preprocessing
• So, suppose that we have the following example text:
Documents are an interesting application field
for data mining techniques.
• To this text, we might do the following preprocessing
operations:
1. Find the basic forms of the words (stemming)
2. Use stop list to remove uninteresting words
3. Select, e.g., the wanted word classes (e.g., nouns)
Course on Data Mining
Page
13
13/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Preprocessing
(Documents, 1)
(are, 2)
(an, 3)
(interesting, 4)
(application, 5)
(field, 6)
(for, 7)
(data, 8)
(mining, 9)
(techniques, 10)
(., 11)
(document_N_PL, 1)
(be_V_PRES_PL, 2)
(an_DET, 3)
(interesting_A_POS, 4)
(application_N_SG, 5)
(field_N_SG, 6)
(for_PP, 7)
(data_N_SG, 8)
(mining_N_SG, 9)
(technique_N_PL, 10)
(STOP, 11)
Morphological information: N = noun, PL = plural, V = verb, PRES =
present form, DET = determinant, A = adjective, POS = positive, SG =
singular, PP=preposition
Course on Data Mining
Page
14
14/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Preprocessing
(document_N_PL, 1)
(be_V_PRES_PL, 2)
(an_DET, 3)
(interesting_A_POS, 4)
(application_N_SG, 5)
(field_N_SG, 6)
(for_PP, 7)
(data_N_SG, 8)
(mining_N_SG, 9)
(technique_N_PL, 10)
(STOP, 11)
(document_N_PL, 1)
(interesting_A_POS, 4)
(application_N_SG, 5)
(field_N_SG, 6)
(data_N_SG, 8)
(mining_N_SG, 9)
(technique_N_PL, 10)
Morphological information: N = noun, PL = plural, V = verb, PRES =
present form, DET = determinant, A = adjective, POS = positive, SG =
singular, PP=preposition
Course on Data Mining
Page
15
15/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Preprocessing
(document_N_PL, 1)
(document_N_PL, 1)
(interesting_A_POS, 4)
(application_N_SG, 5)
(field_N_SG, 6)
(application_N_SG, 5)
(field_N_SG, 6)
(data_N_SG, 8)
(mining_N_SG, 9)
(technique_N_PL, 10)
(data_N_SG, 8)
(mining_N_SG, 9)
(technique_N_PL, 10)
Morphological information: N = noun, PL = plural, V = verb, PRES =
present form, DET = determinant, A = adjective, POS = positive, SG =
singular, PP=preposition
Course on Data Mining
Page
16
16/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Preprocessing
4
5
6
7
technique
3
mining
2
data
1
field
0
application
document
• Now we have a preprocessed sequence of words
8
9
10
11
• We might also just throw away the stop words etc., and
put words in consecutive "time slots" (1, 2, 3, …)
• Preprocessing can be applied to transaction-based text
data in a similar fashion
Course on Data Mining
Page
17
17/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Types of Text Mining
• Keyword (or term) based association analysis
• Automatic document classification
• Similarity detection
– Cluster documents by a common author
– Cluster documents containing information from a
common source
• Sequence analysis: predicting a recurring event,
discovering trends
• Anomaly detection: find information that violates usual
patterns
Course on Data Mining
Page
18
18/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Term-Based Assoc. Analysis
• Collect sets of keywords or terms that occur frequently
together and then find the association relationships
among them
• First preprocess the text data by parsing, stemming,
removing stopwords, etc.
• Then evoke association mining algorithms
– Consider each document as a transaction
– View a set of keywords/terms in the document as a set
of items in the transaction
Course on Data Mining
Page
19
19/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Term-Based Assoc. Analysis
• For example, we might find frequent sets such as:
2%: application, field
5%: data, mining
• …and association rules like:
application  field (2%,52%)
data  mining (5%,75%)
• These kind of frequent sets etc. might help in
expanding user queries or in describing better the
documents than simple key words
• Sometimes it would be nice to discover new descriptive
phrases directly from the actual text - what then?
Course on Data Mining
Page
20
20/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Term-Based Episode Analysis
• Now, we want to find words/terms that occur frequently
close to each other in the actual text
• Take the preprocessed sequential text data and then
find relationships among the words/terms by evoking
episode mining algorithms (WINEPI or MINEPI)
• For example, we might find frequent episodes such as:
data, mining, knowledge, discovery
• …and MINEPI style episode rules like:
data, mining 
knowledge, discovery [4] [8] (2%,81%)
Course on Data Mining
Page
21
21/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Problems
• Quite often, it could be interesting to try to find very
long descriptive phrases to describe the documents…
• …but discovery of long descriptive phrases might be
tedious, especially if and when you'll have to create all
shorter phrases in order to get the longest ones
• One answer can be maximal frequent sequences or
maximal frequent phrases (note: by concepts
"sequence" and "phrase" we mean basically the same)
Course on Data Mining
Page
22
22/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Mining
Background
What is Text Mining?
MaxFreq Sequences
MaxFreq Algorithms
MaxFreq Experiments
Course on Data Mining
Page
23
23/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Frequent Word Sequences
• Assume: S is a set of documents; each document consists
of a sequence of words
• A phrase is a sequence of words
• A sequence p occurs in a document d if all the words of p
occur in d, in the same order as in p
• A sequence p is frequent in S if p occurs in at least 
documents of S, where  is a frequency threshold given
• A maximal gap n can be given: the original locations of
any two consecutive words of a sequence can have at most
n words between them
Course on Data Mining
Page
24
24/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Frequent Word Sequences
1: (The,70) (Congress,71) (subcommittee,72) (backed,73)
(away,74) (from,75) (mandating,76) (specific,77)
(retaliation,78) (against,79) (foreign,80)
(countries,81) (for,82) (unfair,83) (foreign,84)
(trade,85) (practices,86)
2: (He,105) (urged,106) (Congress,107) (to,108)
(reject,109) (provisions,110) (that,111) (would,112)
(mandate,113) (U.S.,114) (retaliation,115)
(against,116) (foreign,117) (unfair,118) (trade,119)
(practices,120)
3: (Washington,407) (charged,408) (France,409)
(West,410) (Germany,411) (the,412) (U.K.,413) (Spain,
414) (and,415) (the,416) (EC,417) (Commission,418)
(with,419) (unfair,420) (practices,421) (on,422)
(behalf,423) (of,424) (Airbus,425)
Course on Data Mining
Page
25
25/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Frequent Word Sequences
Examples from the previous slides:
• The phrase
(retaliation, against, foreign, unfair,
trade, practices)
occurs in the first two documents, in the locations (78, 79,
80, 83, 85, 86) and (115, 116, 117, 118, 119, 120).
• The phrase (unfair, practices) occurs in all the
documents, namely in the locations (83, 86), (118, 120),
and (420, 421).
Note that we only count one occurrence of a sequence/doc!
Course on Data Mining
Page
26
26/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Maximal Frequent Sequences
• Maximal frequent sequence:
– A sequence p is a maximal frequent (sub)sequence in S
if there does not exist any other sequence p' in S such
that p is a subsequence of p' and p' is frequent in S
• Shortly, a maximal frequent sequence is a sequence of
words that
– appears frequently in the document collection
– is not included in another longer frequent sequence
Course on Data Mining
Page
27
27/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Maximal Frequent Sequences
• Usually, it makes sense to concentrate on the maximal
frequent sequences or maximal frequent phrases
– Subsequences or subphrases usually do not have own
meaning
– However, sometimes also subsequences or subphrases
may be interesting, if they are much more frequent
Course on Data Mining
Page
28
28/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
A Maximal Seq. with Subseq.s
• Example (maximal sequence + subsequences):
dow jones industrial average
dow jones
dow industrial
dow average
jones industrial
jones average
industrial average
dow jones industrial
dow jones average
jones industrial average
Course on Data Mining
Page
29
29/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Examples of Meaningful Subseqs
• Interesting subsequences can be distinguished by the
characteristic that they are more frequent than the
maximal sequences
– Subsequence has its OWN occurrences in the text
– Subsequence might be joint to MANY maximal
sequences
– TOO FREQUENT subsequence might NOT be
interesting
Course on Data Mining
Page
30
30/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Examples of Meaningful Subseqs
• Maximal sequences:
prime minister Lionel Jospin
prime minister Paavo Lipponen
• Subsequences:
prime minister
Lionel Jospin
Paavo Lipponen
Course on Data Mining
Page
31
31/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Mining
Background
What is Text Mining?
MaxFreq Sequences
MaxFreq Algorithms
MaxFreq Experiments
Course on Data Mining
Page
32
32/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Discovery of Frequent Sequences
• Frequency of a sequence cannot be decided locally: all
the instances in the collection has to be counted
• However: already a document of length 20 (words)
contains over one million sequences
• Only small fraction of sequences are frequent
– There are many sequences that have only very few
occurrences
Course on Data Mining
Page
33
33/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Naïve Discovery Approach
• Basic idea: the "standard" bottom-up approach
– Collect all the pairs from the documents, count them,
and select the frequent ones
– Build sequences of length p+1 from frequent sequences
of length p
– Select sequences that are frequent
– Iterate
• Finally: select maximal sequences (by checking for each
phrase, whether it is contained in some other phrase)
Course on Data Mining
Page
34
34/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Problems in the Naïve Approach
• Problem: frequent sequences in text can be long
– In our experiments: longest phrase 22 words (Reuters21578 newswire data, 19000 documents, frequency
threshold 15, max gap 2)
– Processing all the subphrases of all lengths is not
possible
– Straightforward bottom-up approach does not work
– Restriction of the length would produce a large amount
of slightly differing subphrases of a phrase that is
longer than the threshold
Course on Data Mining
Page
35
35/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Combining Bottom-Up and
Greedy Approaches: MaxFreq
• First, frequent pairs are collected
 Initial phase
• Longer sequences are constructed from shorter sequences
(k-grams) as in the bottom-up approach
 Discovery phase
• Maximal sequences are discovered directly, starting from a
k-gram that is not a subsequence of any known maximal
sequence
 Expansion step
Course on Data Mining
Page
36
36/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Combining Bottom-Up and
Greedy Approaches: MaxFreq
• Each maximal sequence has at least one unique
subsequence that distinguishes it from the other maximal
sequences. A maximal sequence is discovered, at the latest,
on the level k, where k is the length of the shortest unique
subsequence.
• Grams that cannot be used to construct any new maximal
sequences are pruned away after each level, before the
length of grams is increased
 Pruning step
• Let's take a closer look at these phases and steps!
Course on Data Mining
Page
37
37/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Initial Phase
Input:
Output:
a set of documents S, a frequency threshold,
and a maximal gap
a gram set Grams2 containing the frequent pairs
For all the documents d  S
collect all the ordered pairs of words (A,B) within d
such that A and B occur in this order (wrt maximal gap)
Grams2 = all the ordered pairs that are frequent in the set S
(wrt frequency threshold)
Return Grams2
Course on Data Mining
Page
38
38/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Initial Phase
Document 1: (A,11) (B,12) (C,13) (D,14) (E,15)
Document 2: (P,21) (B,22) (C,23) (D,24) (K,25)
Document 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36)
Document 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46)
Document 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)
Document 6: (R,61) (H,62) (K,63) (L,64) (M,65)
Course on Data Mining
Page
39
39/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Initial Phase
• The following pairs of words could be found (with max
gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2
([31-32]), while AE is unfrequent ([11-15] > max gap).
AB
AC
AD
AH
BC
BD
2
2
1
1
5
4
BE
BH
BK
CD
CE
CH
3
1
2
4
3
1
CK
CL
CN
DE
DK
DN
3
1
1
2
2
1
EL
EM
EN
HD
HK
HL
1
1
1
1
2
1
HM 1
KE 1
KL 2
KM 2
LM 2
PB 3
PC
PD
PK
RH
RK
RL
Course on Data Mining
3
2
1
1
1
1
Page
40
40/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Discovery Phase
Input: a gram set Grams2 containing the frequent pairs (A, B)
Output: the set Max of maximal frequent phrases
k := 2; Max := 
While Gramsk is not empty
For all grams g  Gramsk
If a gram g is not a subphrase of some m  Max
If a gram g is frequent
max := Expand(g)
Max := Max  max
If max = g Remove {g} from Gramsk
Else Remove {g} from Gramsk
Prune(Gramsk)
Join the grams of Gramsk to form Gramsk+1
k := k + 1
Return Max
Course on Data Mining
Page
41
41/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Expansion Step
Input:
Output:
a phrase p
a maximal frequent phrase p' such that p is a subphrase of p'
Repeat
Let l be the length of the sequence p.
Find a sequence p' such that the length of p' is l+1,
and p is a subsequence of p'.
Note!
If p' is frequent
All the possibilities
p := p'
to expand has to be
Until there exists no frequent p'
Return p
checked: tail, front,
middle!
Course on Data Mining
Page
42
42/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Expansion Step
1:
2:
3:
4:
5:
6:
(A,11) (B,12) (C,13) (D,14) (E,15)
(P,21) (B,22) (C,23) (D,24) (K,25)
(A,31) (B,32) (C,33) (H,34) (D,35) (K,36)
(P,41) (B,42) (C,43) (D,44) (E,45) (N,46)
(P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57)
(R,61) (H,62) (K,63) (L,64) (M,65)
Freq:
AB BD CD DE KL PB
AC BE CE DK KM PC
BC BK CK HK LM PD
Exp:
AB =>
BE =>
ABC => ABCD (- ABCDE, ABCDK)
BCE => BCDE
Course on Data Mining
Page
43
43/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Example
• Maximal frequent sequences after the first expansion
step:
AB
BE
BK
KL
PD
HK
=>
=>
=>
=>
=>
ABC =>
BCE =>
BDK =>
KLM
PBD =>
ABCD
BCDE
BCDK
PBCD
Course on Data Mining
Page
44
44/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Example
• 3-grams after join:
ABC ACK CDE
ABD BCD CDK
ABE BCE PBC
ABK BCK PBD
ACD BDE PBE
ACE BDK PBK
PCD
PCE
PCK
PDE
PDK
BKL
BKM
CKL
CKM
DKL
DKM
KLM
italics+
underlined=
already found
maximal phrase
• New maximal frequent sequences:
PBE => PBCE
PBK => PBCK
Course on Data Mining
Page
45
45/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Example
• 3-grams after the second expansion step:
ABC
ABD
ACD
BCD
BCE
BCK
BDE
BDK
CDE
CDK
PBC
PBD
PBE
PBK
PCD
PCE
BCDK
PBCD
PBCE
PBCK
PBDE
PBDK
PCDE
PCDK
PCK
• 4-grams after join:
ABCD
ABCE
ABCK
ABDE
ABDK
ACDE
ACDK
BCDE
Course on Data Mining
Page
46
46/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Pruning Step
• After expansion step, every gram is a subsequence of some
maximal sequence
• For any other maximal sequence m not found yet: m has to
contain grams from two or more other maximal sequences,
or from one sequence m' in a different order than in m'
• For each gram g: check if g can join grams of maximal
sequences in a new way
=> extract sequences that are frequent and not yet
included in any maximal sequence; mark the grams
• Remove grams that are not marked
Course on Data Mining
Page
47
47/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Pruning After the 1st Exp. Step
•
•
•
•
•
BC: ABCD, BCDE, BCDK, PBCD
Prefixes: A, P
Suffixes: D, DE, DK
Check the strings ABCDE, ABCDK, PBCDE, PBCDK
 a subsequence that is frequent and not included in any
maximal sequence?
ABCDE - ABC - ABCD
(maximal)
- ABCE
(not frequent)
- BCD - BCDE
(maximal)
- ABCD
(known)
- BCE - ABCE
(known)
Course on Data Mining
Page
48
48/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Pruning After the 1st Exp. Step
PBCDE - PBC
- BCD
- BCE
PBCDK - PBC
- PBCD
- PBCE
- BCDE
- PBCD
- PBCE
(maximal)
(frequent, not in maximal)
(maximal)
(known)
(known)
- PBCD (maximal)
- PBCK (frequent, not in maximal)
...
Marked: PB, BC, CE, CK
All the other grams are removed.
Course on Data Mining
Page
49
49/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Algorithm: Implementation
Data structures:
• Table:
for each pair its exact occurrences in text
• Table:
for each prefix the grams that have this prefix
• Table:
for each suffix the grams that have this suffix
• Table:
for each pair the indexes of maximal
sequences within which it is a subsequence
• An array of maximal sequences
• Document identifiers are attached to the grams and
occurrences
Course on Data Mining
Page
50
50/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Testing Frequency
• The occurrences of frequent pairs are stored:
AB: [11-12][31-32]
AC: [11-13][31-33]
BC: [12-13][22-23][32-33][42-43][52-53]
• The occurrences of longer sequences are computed
from the occurrences of pairs
• All the occurrences computed are stored
– The computation for ABC may help to compute later
the frequency for ABCD
Course on Data Mining
Page
51
51/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Testing Frequency
– ABCD can only occur in places where ABC has
occurred
• NOTE:
– Already calculated occurrences can be used while
adding elements to the front or to the tail
– ABCD may occur in more documents than ABD, since
the distance of B and D might be greater than the
maximal gap
Course on Data Mining
Page
52
52/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Mining
Background
What is Text Mining?
MaxFreq Sequences
MaxFreq Algorithms
MaxFreq Experiments
Course on Data Mining
Page
53
53/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Experiments
• Data: Reuters-21578 newswire collection (year 1987)
• Around 19000 documents (average length 135 words)
• Originally 2.5 million words, after stopword pruning (400
stopwords) 1.3 million words
– Stopwords: single letters, pronouns, prepositions, some
abbreviations (e.g., pct, dlr, cts, shr), etc.
• 50.000 distinct words (stemming was not used)
• Frequency threshold 15, max gap 2 (stopwords pruned)
• Prototype implementation in Perl
• Sun Enterprise 450, with 1 GB of main memory
Course on Data Mining
Page
54
54/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Experiments
• Amounts of maximal frequent sequences of different
lengths:
Len 2
3
4
5
6 7 8 9 10 11 12
f:15 7,664 1,320 353 146 65 17 8 4 13 12 13
Len 13
f:15 5
14
-
15
1
16
1
17 18 19 20 21 22 23
- 1 - - - 2 -
Course on Data Mining
Page
55
55/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Examples of MaxFreq Sequences
• Solid, established phrases:
bundesbank president karl otto poehl
european monetary system ems
• Verb phrases:
bank england provided money market assistance
board declared stock split payable april
boost domestic demand
• Short phrases:
expects higher
expects complete
Course on Data Mining
Page
56
56/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Phrases Extracted from "Doc A"
• The following phrases are extracted from one document
belonging to the Reuters data set
• The phrases contain both maximal phrases and subphrases
that are more frequent than the maximal ones
• The document describes a situation, where the persons
monitoring the nuclear power plant operation were catched
asleep during their shift and the Nuclear Regulatory
Commission ordered the power plant to be closed
• As you can see, the phrases do not actually reveal what
happened, they just tell about the subject matter
Course on Data Mining
Page
57
57/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Phrases Extracted from "Doc A"
power station
11
immediately after
26
co operations
11
effective april
63
company's operations
20
unit nuclear
12
unit power
16
early week
42
senior management
28
nuclear regulatory commission 14
-regulatory commission
34
nuclear power plant
26
-power plant
55
-nuclear power
42
-nuclear plant
42
electric co
143
Course on Data Mining
Page
58
58/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Phrases Extracted from "Doc B"
• Maximal frequent sequence (frequency = 15):
federal reserve entered u.s. government
securities market arrange repurchase agreements
fed dealers federal funds trading fed began
temporary supply reserves banking system
• One occurrence of the phrase:
The Federal Reserve entered the U.S. Government
securities market to arrange 1.5 billion dlrs
of customer repurchase agreements, a Fed
spokesman said. Dealers said Federal funds were
trading at 6-3/16 pct when the Fed began its
temporary and indirect supply of reserves to
the banking system.
Course on Data Mining
Page
59
59/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Phrases Extracted from "Doc B"
• The frequency of the sequence is 13, and it contains the
following subsequences that are more frequent:
arrange repurchase 23
fed federal
25
fed funds
23
fed temporary
23
market arrange
23
market trading
41
u.s. government
160
u.s. dealers
32
u.s. trading
35
u.s. supply
26
reserves system
36
securities arrange 23
securities trading 32
government arrange 23
banking system
trading fed
trading system
reserve u.s.
supply reserves
supply system
dealers federal
dealers funds
dealers trading
federal u.s.
federal trading
funds trading
reserve u.s. government
reserves banking system
66
22
25
43
36
25
30
27
33
28
30
43
31
25
Course on Data Mining
Page
60
60/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Use of Frequent Phrases
• Goal: rich computational representation for documents
– Feature sets for analysis
– Human-readable description
• Applications
– Key phrases in information retrieval
– Overview to the collection: clustering
– Summary of the content
– Automatic generation of hypertext links
– Associations between documents
– Browsing of document collection
Course on Data Mining
Page
61
61/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Use of Frequent Phrases
• Example: suppose that a query "agricultur*" has
been made
agricultur*
QUERY
• The user has been given a "middle-level list" of phrases
that tell something more about the context around the
words in the query
Course on Data Mining
Page
62
62/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Use of Frequent Phrases
agricultural exports
agricultural production
agricultural products
agricultural stabilization conservation service
agricultural subsidies
agricultural trade
u.s. agriculture
agriculture department usda
agriculture department wheat
agriculture minister
agriculture officials
agriculture undersecretary daniel amstutz
common agricultural policy
ec agriculture ministers
european community agriculture
Course on Data Mining
Page
63
63/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Use of Frequent Phrases
• Suppose that the user is interested in subject
"agricultural subsidies" and selects it from the
list
• As an answer to the query, one might now return all the
sentences containing the phrase "agricultural
subsidies" (e.g., the ones on the next pages)
• Alternatively, the user might want to see directly the whole
documents in which the phrase appears, or the other
phrases that occur together with the phrase
"agricultural subsidies" in the documents
Course on Data Mining
Page
64
64/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Summary
• Text mining:
– The "roots" are in text databases and information
retrieval
– Data mining techniques might complement or help the
existing database/information retrieval techniques
• In this lecture, only a few methods based of association
and episode style algorithms were given:
– Naïve approaches applicable to some extent, maximal
frequent phrases might be useful in some cases
– Many clustering, classification and similarity
techniques that will be presented on the next lectures,
are useful to go a few steps further
Course on Data Mining
Page
65
65/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
References
• Helena Ahonen-Myka: Finding All Frequent Maximal Sequences in
Text. In ICML-99 Workshop on Machine Learning in Text Data
Analysis, p. 11-17, J. Stefan Institute, Ljubljana 1999. See electronic
version at http://www.cs.helsinki.fi/u/hahonen/ham_icml99.ps
• Han, J., Kamber, M.: Data Mining: Concepts and Techniques (also
available at "http://www.cs.sfu.ca/~han/DM_Book.html"), Section 9.5
of the book.
• Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and Inkeri
Verkamo. Applying Data Mining Techniques for Descriptive Phrase
Extraction in Digital Document Collections. In Advances in Digital
Libraries'98, April 1998. See electronic version at http://wwwdb.informatik.uni-tuebingen.de/forschung/papers/adl98.ps
Course on Data Mining
Page
66
66/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Course Organization
Next Week
•
Lecture 14.11.: Clustering,
Classification, Similarity
– Pirjo gives the lecture
•
Excercise 15.11.: Text mining
– Pirjo takes care of you! :-)
•
Seminar 9.11.: Text mining
– Mika gives the lecture
– 2 group presentations (groups 5-6)
Course on Data Mining
Page
67
67/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Seminar Presentations/Groups 5-6
Feldman et. al
R. Feldman et al.: "Knowledge
Management: A Text Mining
Approach", PAKM 1998.
Lent, Agrawal, Srikant
B. Lent, R. Agrawal, R.
Srikant: "Discovering Trends
in Text Databases", KDD
1997.
Course on Data Mining
Page
68
68/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Seminar Presentations
• Requirements:
• Remember:
– Articles are given on previous
week's Wed
– Presentation in an HTML page
(around 3-5 printed pages) due
to seminar starting:
• Can be either a HTML
page or a printable
document in
PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
– Try to understand the
"message" in the article
– Try to present the basic ideas
as clearly as possible, use
examples
– Do not present detailed
mathematics or algorithms
– Test: do you understand your
own presentation?
– In the presentation, use
PowerPoint or conventional
slides
Course on Data Mining
Page
69
69/70
Mika Klemettinen and Pirjo Moen
University of Helsinki/Dept of CS
Autumn 2001
Text Mining
Thank you for
your attention!
Thanks to Helena Ahonen-Myka and Jiawei Han for their slides
which greatly helped in preparing this lecture!
Course on Data Mining
Page
70
70/70