Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Part II: Web Content Mining Chapter 3: Clustering • • • • • Introduction Hierarchical Agglomerative Clustering K-Means Clustering Probability-Based Clustering Collaborative Filtering Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 1 Two Class Mixture 0 B 0.961 A 0 A 0 B 0.780 A 0 B 0 A 0 B 0.980 A 0 B 0.135 A 0.490 B 0.928 B 0 B 0.658 A 0 A 0 A 0.387 A 0.570 B 0 Normal (Gaussian) Distribution 2 Probability Density A 1.5 A B 1 0.5 0 -1 -0.5 Class A B Mean A 0.132 B 0.494 0 0.5 1 Standard deviation A 0.229 B 0.449 1.5 2 Probability of sampling P( A) 0.55 P( B) 0.45 Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 2 Finite Mixture Problem Given the labeled data, for each class C compute: 1 • mean C C x xC • standard deviation C • probability of sampling 1 C 2 ( x ) C xC P (C ) Generative document model C , C , P(C) Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 3 Finite Mixture Problem (2) A 0 B 0.961 A 0 A 0 B 0.780 A 0 B 0 A 0 B 0.980 A 0 B 0.135 A 0.490 B 0.928 B 0 B 0.658 A 0 A 0 A 0.387 A 0.570 B 0 1 A (0 + 0 + 0 + 0 + 0 + 0 + 0.49 + 0 + 0 + 0.387 + 0.57) 0.132 11 1 B (0.961 + 0.780 + 0 + 0.980 + 0.135 + 0.928 + 0 + 0.658 + 0) 0.494 9 A 0.229 B 0.449 11 P( A) 0.55 20 9 P( B) 0.45 20 Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 4 Classification Problem Given C , C , P(C) for class A and B compute P(A|x) and P(B|x). Use P( x | C ) P(C ) if x is a discrete variable P(C | x) P( x) f C ( x) P(C ) P(C | x) P( x) if x is a continuous variable probability density function f C ( x) 1 2 C ( x C ) 2 e Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 2 C 2 5