Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Text and Web Data
Contents of this Chapter
Introduction
Data Preprocessing
Text and Web Clustering
Text and Web Classification
[Han & Kamber 2006, Sections 10.4 and 10.5]
References to research papers at the end of this chapter
SFU, CMPT 741, Fall 2009, Martin Ester
302
The Web and Web Search
Repository
Web Server
Clustering
Classification
Indexer
Inverted
Index
engine
jaguar
cat
Topic Hierarchy
Root
News
Plants
Science
Animals
Storage
Server
Crawler
The jaguar has a 4 liter engine
The jaguar, a cat, can run at
speeds reaching 50 mph
Documents in repository
Business
Computers
jaguar
Automobiles
SFU, CMPT 741, Fall 2009, Martin Ester
Search
Query
303
Web Search Engines
Keyword Search
“data mining”
519,000 results
SFU, CMPT 741, Fall 2009, Martin Ester
304
Web Search Engines
Boolean search
• care AND NOT old
My0 care1 is loss of care
with old care done
D1
Phrases and proximity
• “new care”
• loss NEAR/5 care
• <SENTENCE>
Your care is gain of
care with new care won
care
D2
D1: 1, 5, 8
D2: 1, 5, 8
Indexing
Inverted Lists
How to rank the result lists?
new
D2: 7
old
D1: 7
loss
D1: 3
SFU, CMPT 741, Fall 2009, Martin Ester
305
Ranking Web Pages
Page Rank [Brin & Page 98]
• Idea: the more high ranked pages link to a web page, the higher its rank.
PageRank (u )
PageRank (v)  p  (1  p) 
u v OutDegree (u )
• Interpretation by random walk:
PageRank is the probability that a “random surfer” visits a page
» Parameter p is probability that the surfer gets bored and starts on a new
random page.
» (1-p) is the probability that the random surfer follows a link on current
page.
• PageRanks correspond to principal eigenvector of the normalized link matrix.
• Can be calculated using an efficient iterative algorithm.
SFU, CMPT 741, Fall 2009, Martin Ester
306
Ranking Web Pages
Hyperlink-Induced Topic Search (HITS) [Kleinberg 98]
Definitions
• Authorities: highly-referenced pages on a topic.
• Hubs: pages that “point” to authorities.
• A good authority is pointed to by many good hubs;
a good hub points to many good authorities.
Hubs
Authorities
SFU, CMPT 741, Fall 2009, Martin Ester
307
Ranking Web Pages
HITS
Method
• Collect seed set of pages S (e.g., returned by search engine).
• Expand seed set to contain pages that point to or are pointed to by pages in
seed set.
• Initialize all hub/authority weights to 1.
• Iteratively update hub weight h(p) and authority weight a(p) for each page:
a( p) 
 h( q )
h( p ) 
q p
 a(q)
p q
• Stop, when hub/authority weights converge.
SFU, CMPT 741, Fall 2009, Martin Ester
308
Ranking Web Pages
Comparison
• PageRanks computed initially for web pages independent of search query.
• PageRank is used by web search engines such as Google.
• HITS: Hub and authority weights computed for different root sets in the
context of a particular search query.
• HITS is applicable, e.g., for ranking the crawl front of a focused web
crawler.
• Can use the relevance of a web page for the given topic to weight the edges
of the web (sub) graph.
SFU, CMPT 741, Fall 2009, Martin Ester
309
Directory Services
Browsing
“data mining”
~ 200 results
SFU, CMPT 741, Fall 2009, Martin Ester
310
Directory Services
Topic Hierarchy
• Provides a hierarchical classification of documents / web pages (e.g., Yahoo!)
Yahoo home page
Recreation
Travel
Sports
Business
Companies
Science
Finance
News
Jobs
• Topic hierarchy can be browsed interactively to find relevant web pages.
• Keyword searches performed in the context of a topic: return only a subset of
web pages related to the topic.
SFU, CMPT 741, Fall 2009, Martin Ester
311
Mining Text and Web Data
Shortcomings of the Current Web Search Methods
Low precision
• Thousands of irrelevant documents returned by web search engine
99% of information of no interest to 99% of people.
Low recall
• In particular, for directory services (due to manual acquisition).
• Even largest crawlers cover less than 50% of all web pages.
Low quality
• Many results are out-dated, broken links etc.
SFU, CMPT 741, Fall 2009, Martin Ester
312
Mining Text and Web Data
Data Mining Tasks
• Classification of text documents / web pages
To insert into topic hierarchy, as feedback for focused web crawlers, . . .
• Clustering of text documents / web pages
To create topic hierarchy, as front-end for web search engines . . .
• Resource discovery
Discovery of relevant web pages using a focused web crawler,
in particular ranking of the links at the crawl front
SFU, CMPT 741, Fall 2009, Martin Ester
313
Mining Text and Web Data
Challenges
• Ambiguity of natural language
synonyms, homonyms, . . .
• Different natural languages
English, Chinese, . . .
• Very high dimensionality of the feature spaces
thousands of relevant terms
• Multiple datatypes
text, images, videos, . . .
• Web is extremely dynamic
> 1 million pages added each day
• Web is a very large distributed database
huge number of objects, very high access costs
SFU, CMPT 741, Fall 2009, Martin Ester
314
Text Representation
Preprocessing
• Remove HTML tags, punctuation etc.
• Define terms (single-word / multi-word terms)
• Remove stopwords
• Perform stemming
• Count term frequencies
• Some words are more important than others
smooth the frequencies,
e.g. weight by inverse document frequency
SFU, CMPT 741, Fall 2009, Martin Ester
315
Text Representation
Transformation
• Different definitions of inverse document frequency
n(d,t): number of occurrences of term t in document d
n( d , t )
,
d n(d , t )
n( d , t )
n( d , t )
,
t n(d , t ) max t n(d , t )
• Select “significant” subset of all occurring terms
• Vocabulary V, term ti, document d represented as
Vector Space Model
rep(d )  n(d , ti )ti V
 Most n’s are zeroes for a single document
SFU, CMPT 741, Fall 2009, Martin Ester
316
Text Representation
Similarity Function
similarity (d 1, d 2) 
mining
rep (d 1), rep (d 2)
 ,   inner product
rep (d 1)  rep (d 2)
similar
Cosine Similarity
data
document
SFU, CMPT 741, Fall 2009, Martin Ester
317
Text Representation
Semantic Similarity
Where can I fix my scooter?
A great garage to repair your 2-wheeler is …
 A scooter is a 2-wheeler.
 Fix and repair are synonyms.
Two basic approaches for semantic similarity
• Hand-made thesaurus (e.g., WordNet)
• Co-occurrence and associations
SFU, CMPT 741, Fall 2009, Martin Ester
… auto …car
… auto …car
… car
… auto
… auto
…car
… car … auto
… car … auto
car  auto
… auto …
… car …
318
Text and Web Clustering
Overview
Standard (Text) Clustering Methods
 Bisecting k-means
 Agglomerative Hierarchical Clustering
Specialised Text Clustering Methods
 Suffix Tree Clustering
 Frequent-Termset-Based Clustering
Joint Cluster Analysis
• Attribute data: text content
• Relationship data: hyperlinks
SFU, CMPT 741, Fall 2009, Martin Ester
319
Text and Web Clustering
Bisecting k-means [Steinbach, Karypis & Kumar 2000]
K-means
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
3
2
1
0
0
0
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
10
Bisecting k-means
• Partition the database into 2 clusters
• Repeat: partition the largest cluster into 2 clusters . . .
• Until k clusters have been discovered
SFU, CMPT 741, Fall 2009, Martin Ester
320
10
Text and Web Clustering
Bisecting k-means
Two types of clusterings
• Hierarchical clustering
• Flat clustering: any cut of this hierarchy
Distance Function
• Cosine measure (similarity measure)
s ( ,  ) 
g (c( )), g (c(  ))
g (c( ))  g (c(  ))
SFU, CMPT 741, Fall 2009, Martin Ester
321
Text and Web Clustering
Agglomerative Hierarchical Clustering
1.
Form initial clusters consisting of a singleton object, and compute
the distance between each pair of clusters.
2. Merge the two clusters having minimum distance.
3. Calculate the distance between the new cluster and all other clusters.
4. If there is only one cluster containing all objects:
Stop, otherwise go to step 2.
Representation of a Cluster C
rep(C )   n(d , ti )
dC
ti V
SFU, CMPT 741, Fall 2009, Martin Ester
322
Text and Web Clustering
Experimental Comparison [Steinbach, Karypis & Kumar 2000]
Clustering Quality
•
•
•
•
Measured as entropy on a prelabeled test data set
Using several text and web data sets
Bisecting k-means outperforms k-means.
Bisecting k-means outperforms agglomerative hierarchical clustering.
Efficiency
•
•
Bisecting k-means is much more efficient than agglomerative
hierarchical clustering.
O(n) vs. O(n2)
SFU, CMPT 741, Fall 2009, Martin Ester
323
Text and Web Clustering
Suffix Tree Clustering [Zamir & Etzioni 1998]
Forming Clusters
Not by similar feature vectors
But by common terms
Strengths of Suffix Tree Clustering (STC)
Efficiency: runtime O(n) for n text documents
Overlapping clusters
Method
1. Identification of „basic clusters“
2. Combination of basic clusters
SFU, CMPT 741, Fall 2009, Martin Ester
324
Text and Web Clustering
Identification of Basic Clusters
• Basic Cluster: set of documents sharing one specific phrase
• Phrase: multi-word term
• Efficient identification of basic clusters using a suffix-tree
cat ate cheese
ate cheese
cheese
Insertion of (1) „cat ate cheese“
2, 4
too
cat ate cheese
1, 1
1, 2
ate cheese
1, 3
cheese
too
1, 1
1, 2
mouse ate cheese too
too
1, 3
2, 1
Insertion of (2) „mouse ate cheese too“
2, 2
SFU, CMPT 741, Fall 2009, Martin Ester
2, 3
325
Text and Web Clustering
Combination of Basic Clusters
• Basic clusters are highly overlapping
• Merge basic clusters having too much overlap
• Basic clusters graph: nodes represent basic clusters
Edge between A and B iff |A  B| / |A| > 0,5 and |A  B| / |B| > 0,5
• Composite cluster:
a component of the basic clusters graph
• Drawback of this approach:
Distant members of the same component need not be similar
No evaluation on standard test data
SFU, CMPT 741, Fall 2009, Martin Ester
326
Text and Web Clustering
Example from the Grouper System
SFU, CMPT 741, Fall 2009, Martin Ester
327
Text and Web Clustering
Frequent-Term-Based
Clustering [Beil, Ester & Xu 2002]
• Frequent term set:
description of a cluster
• Set of documents containing
all terms of the frequent term
set: cluster
• Clustering: subset of set of all
frequent term sets covering
the DB with a low mutual
overlap
{} {D1, . . ., D16}
{sun}
{fun}
{beach}
{D1, D2, D4, D5,
{D1, D3, D4, D6,
{D2, D7, D8,
D6, D8, D9, D10,
D7, D8, D10, D11,
D9, D10, D12,
D11, D13, D15}
D14, D15, D16}
D13,D14, D15}
{sun, fun}
{sun, beach}
{D1, D4, D6, D8, {D2, D8, D9,
D10, D11, D15}
{fun, surf}
{beach, surf}
...
D10, D11, D15}
{sun, fun, surf}
{sun, beach, fun}
{D1, D6, D10, D11}
{D8, D10, D11, D15}
SFU, CMPT 741, Fall 2009, Martin Ester
...
328
Text and Web Clustering
Method
• Task: efficient calculation of the overlap of a given
(description)
Fi
with
the
union
of
the
other
(description)s
• F: set of all frequent term sets in D
• f j : the number of all frequent term sets supported by document Dj
cluster
cluster
f j | {Fi  F | Fi  D j } |
• Standard overlap of a cluster Ci:
SO(Ci ) 
( f
DjCi
j
 1)
| Ci |
SFU, CMPT 741, Fall 2009, Martin Ester
329
Text and Web Clustering
Algorithm FTC
FTC(database D, float minsup)
SelectedTermSets:= {};
n:= |D|;
RemainingTermSets:= DetermineFrequentTermsets(D, minsup);
while |cov(SelectedTermSets)| ≠ n do
for each set in RemainingTermSets do
Calculate overlap for set;
BestCandidate:=element of Remaining TermSets with minimum overlap;
SelectedTermSets:= SelectedTermSets  {BestCandidate};
RemainingTermSets:= RemainingTermSets - {BestCandidate};
Remove all documents in cov(BestCandidate) from D and from the
coverage of all of the RemainingTermSets;
return SelectedTermSets;
SFU, CMPT 741, Fall 2009, Martin Ester
330
Text and Web Clustering
Experimental Evaluation
Classic Dataset
SFU, CMPT 741, Fall 2009, Martin Ester
331
Text and Web Clustering
Experimental Evaluation
Reuters Dataset
FTC achieves comparable clustering quality more efficiently
Better cluster description
SFU, CMPT 741, Fall 2009, Martin Ester
332
Text and Web Clustering
Joint Cluster Analysis [Ester, Ge, Gao et al 2006]
Attribute data: intrinsic properties of entities
Relationship data: extrinsic properties of entities
Existing clustering algorithms use either attribute or relationship data
Often: attribute and relationship data somewhat related, but contain
complementary information
Informative
Graph
Edges = relationships
2D location = attributes
 Joint cluster analysis of attribute and relationship data
SFU, CMPT 741, Fall 2009, Martin Ester
333
Text and Web Clustering
The Connected k-Center Problem
Given
• Dataset D as informative graph
• k - # of clusters
Connectivity constraint
Clusters need to be connected graph components
Objective function
maximum (attribute) distance (radius) between nodes and cluster center
Task
Find k centers (nodes) and a corresponding partitioning of D into k clusters
such that
• each clusters satisfies the connectivity constraint and
• the objective function is minimized
SFU, CMPT 741, Fall 2009, Martin Ester
334
Text and Web Clustering
The Hardness of the Connected k-Center Problem
Given k centers and a radius threshold r, finding the optimal assignment for the
remaining nodes is NP-hard
Because of articulation nodes (e.g., a)
Assignment of a implies
the assignment of b
b
C1
C2
a
Assignment of a to C1
minimizes the radius
SFU, CMPT 741, Fall 2009, Martin Ester
335
Text and Web Clustering
Example: CS Publications
1073: Kempe, Dobra, Gehrke: “Gossip-based computation of aggregate information”, FOCS'03
1483: Fagin, Kleinberg, Raghavan: “Query strategies for priced information”, STOC'02
True cluster labels based on conference
SFU, CMPT 741, Fall 2009, Martin Ester
336
Text and Web Clustering
Application for Web Pages
• Attribute data
text content
• Relationship data
hyperlinks
• Connectivity constraint
clusters must be connected
• New challenges
clusters can be overlapping
not all web pages must belong to a cluster (noise)
...
SFU, CMPT 741, Fall 2009, Martin Ester
337
Text and Web Classification
Overview
Standard Text Classification Methods
• Naive Bayes Classifier
• Bag of Words Model
Advanced Methods
• Use EM to exploit non-labeled training documents
• Exploit hyperlinks
SFU, CMPT 741, Fall 2009, Martin Ester
338
Text and Web Classification
Naive Bayes Classifier
• Decision rule of the Bayes Classifier: argmax P(d | c j )  P(c j )
c j C
• Assumptions of the Naive Bayes Classifier:
– d = (d1, . . ., dd)
– the attribute values di are independent from each other
• Decision rule of the Naive Bayes Classifier :
d
argmax P(c j )   P(di | c j )
c j C
i 1
SFU, CMPT 741, Fall 2009, Martin Ester
339
Text and Web Classification
Bag of Words Model
• Allows to estimate the P(dj| c) from the training documents
• Neglects position (order) of terms in document (bag!)
• Model for generating a document d of class c consisting of n
terms
• P(d i | c j )
conditional probability of observing term ti,
given that the document belongs to class c
 n( d , t )
• Naive approach : P(d i | c j ) 
d c j
i
 n( d , t
d c j ,t k V
k
)
SFU, CMPT 741, Fall 2009, Martin Ester
340
Text and Web Classification
Bag of Words Model
Problem
• Term ti appears in no training document of class cj
• ti appears in a document d to be classified
 n( d , t )  0
d c j
i
• Document d also contains terms which strongly indicate class cj,
but: P(dj| c) = 0 and P(d| c) = 0
Solution
• Smoothing of the relative frequencies in the training documents
 n( d , t )  1
P(d i | c j ) 
d c j
 n( d , t
d c j ,t k V
i
k
) | V |
SFU, CMPT 741, Fall 2009, Martin Ester
341
Text and Web Classification
Experimental Evaluation [Craven et al. 1998]
Training data
• 4127 web pages of CS departments
• Classes: department, faculty, staff, student, course, . . ., other
Major Results
• Classification accuracy of 70% to 80 % for most classes
• Only 9% accuracy for class staff, but 80% correct in superclass person
• Low classification accuracy for class other
Web pages are harder to classify than „professional“ text documents
SFU, CMPT 741, Fall 2009, Martin Ester
342
Text and Web Classification
Exploiting Unlabeled Documents
[Nigam et al. 1999]
• Labeling is laborious, but many unlabeled documents available.
• Exploit unlabeled documents to improve classification accuracy:
Initial training
Labeled Docs
Assign weighted class
labels (Estimation)
Classifier
re-train
(Maximization)
Unlabeled Docs
SFU, CMPT 741, Fall 2009, Martin Ester
343
Text and Web Classification
Exploiting Unlabeled Documents
• Let training documents d belong to classes in a graded manner
Pr(c|d).
• Initially labeled documents have 0/1 membership.
• Expectation Maximization (EM)
repeat
calculate class model parameters P(di|c);
determine membership probabilities P(c|d);
until classification accuracy does no longer improve.
SFU, CMPT 741, Fall 2009, Martin Ester
344
Text and Web Classification
Exploiting Unlabeled Documents
Small labeled set  large accuracy boost
SFU, CMPT 741, Fall 2009, Martin Ester
345
Text and Web Classification
Exploiting Hyperlinks [Chakrabarti, Dom & Indyk 1998]
• Logical “documents” are often fragmented into several web pages.
• Neighboring web pages often belong to the same class.
• Web documents are extremely diverse.
• Standard text classifiers perform poorly on web pages.
• For classification of web pages, use
 text of the page itself
 text and class labels of neighboring pages.
SFU, CMPT 741, Fall 2009, Martin Ester
346
Text and Web Classification
Exploiting Hyperlinks
• c = class, t = text, N = neighbors
• Text-only model: P(t|c)
• Using neighbors’ text to judge
my topic: P(t, t(N) | c)
• Using neighbors’ class label
to judge my topic: P(t, c(N) | c)
• Most of the neighbor’s class
labels are unknown.
• Method: iterative relaxation labeling.
SFU, CMPT 741, Fall 2009, Martin Ester
?
347
Text and Web Classification
• Using neighbors’ text increases
classification error.
• Even when tagging non-local
texts.
• Using neighbors’ class labels
reduces classification error.
• Using text and class of
neighbors yields the lowest
classification error.
%Error
Experimental Evaluation
40
35
30
25
20
15
10
5
0
0
50
100
%Neighborhood known
Text
SFU, CMPT 741, Fall 2009, Martin Ester
Class
Text+Class
348
References
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", Proc. KDD ’02, 2002.
Brin S., Page L.: ”The anatomy of a large-scale hypertextual Web search engine”, Proc.
WWW7, 1998.
Chakrabarti S., Dom B., Indyk P.: “Enhanced hypertext categorization using hyperlinks”,
Proc. ACM SIGMOD 1998.
Craven M., DiPasquo D., Freitag D., McCallum A., Mitchell T., Nigam K., Slattery S.:
“Learning to Extract Symbolic Knowledge from the World Wide Web”, Proc. AAAI 1998.
Deerwater S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R.: “Indexing by
latent semantic analysis”, Journal of the Society for Information Science, 41(6), 1990.
Ester M., Ge R., Gao B. J., Hu Z., Ben-Moshe B.: “Joint Cluster Analysis of Attribute Data
and Relationship Data: the Connected k-Center Problem”, Proc. SDM 2006.
Kleinberg J.: “Authoritative sources in a hyperlinked environment”, Proc. SODA 1998.
Nigam K., McCallum A., Thrun S., Mitchell T.: “Text Classification from Labeled and
Unlabeled Documents using EM”, Machine Learning, 39(2/3), pp. 103-134, 2000.
Steinbach M., Karypis G., Kumar V.: “A comparison of document clustering techniques”,
Proc. KDD Workshop on Text Mining, 2000.
Zamir O., Etzioni O.: “Web Document Clustering: A Feasibility Demonstration”, Proc.
SIGIR 1998.
SFU, CMPT 741, Fall 2009, Martin Ester
349