Download Part I: Web Structure Mining Chapter 1: Information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Part II: Web Content Mining
Chapter 3: Clustering
•
•
•
•
•
Introduction
Hierarchical Agglomerative Clustering
K-Means Clustering
Probability-Based Clustering
Collaborative Filtering
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.
Slides for Chapter 1: Information Retrieval an Web Search
1
Two Class Mixture
0
B
0.961
A
0
A
0
B
0.780
A
0
B
0
A
0
B
0.980
A
0
B
0.135
A
0.490
B
0.928
B
0
B
0.658
A
0
A
0
A
0.387
A
0.570
B
0
Normal (Gaussian) Distribution
2
Probability Density
A
1.5
A
B
1
0.5
0
-1
-0.5
Class
A
B
Mean
 A  0.132
B  0.494
0
0.5
1
Standard deviation
 A  0.229
 B  0.449
1.5
2
Probability of sampling
P( A)  0.55
P( B)  0.45
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.
Slides for Chapter 1: Information Retrieval an Web Search
2
Finite Mixture Problem
Given the labeled data, for each class C compute:
1
• mean C 
C
x
xC
• standard deviation
C 
• probability of sampling
1
C
2
(
x


)

C
xC
P (C )
Generative document model
C , C , P(C)
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.
Slides for Chapter 1: Information Retrieval an Web Search
3
Finite Mixture Problem (2)
A
0
B
0.961
A
0
A
0
B
0.780
A
0
B
0
A
0
B
0.980
A
0
B
0.135
A
0.490
B
0.928
B
0
B
0.658
A
0
A
0
A
0.387
A
0.570
B
0
1
 A  (0 + 0 + 0 + 0 + 0 + 0 + 0.49 + 0 + 0 + 0.387 + 0.57)  0.132
11
1
 B  (0.961 + 0.780 + 0 + 0.980 + 0.135 + 0.928 + 0 + 0.658 + 0)  0.494
9
 A  0.229
 B  0.449
11
P( A) 
 0.55
20
9
P( B) 
 0.45
20
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.
Slides for Chapter 1: Information Retrieval an Web Search
4
Classification Problem
Given C , C , P(C) for class A and B
compute P(A|x) and P(B|x).
Use
P( x | C ) P(C ) if x is a discrete variable
P(C | x) 
P( x)
f C ( x) P(C )
P(C | x) 
P( x)
if x is a continuous variable
probability density function
f C ( x) 
1
2  C
 ( x  C ) 2
e
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.
Slides for Chapter 1: Information Retrieval an Web Search
2 C 2
5
Related documents