Download Part I: Web Structure Mining Chapter 1: Information

Part II: Web Content Mining Chapter 3: Clustering • • • • • Introduction Hierarchical Agglomerative Clustering K-Means Clustering Probability-Based Clustering Collaborative Filtering Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 1 Two Class Mixture 0 B 0.961 A 0 A 0 B 0.780 A 0 B 0 A 0 B 0.980 A 0 B 0.135 A 0.490 B 0.928 B 0 B 0.658 A 0 A 0 A 0.387 A 0.570 B 0 Normal (Gaussian) Distribution 2 Probability Density A 1.5 A B 1 0.5 0 -1 -0.5 Class A B Mean  A  0.132 B  0.494 0 0.5 1 Standard deviation  A  0.229  B  0.449 1.5 2 Probability of sampling P( A)  0.55 P( B)  0.45 Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 2 Finite Mixture Problem Given the labeled data, for each class C compute: 1 • mean C  C x xC • standard deviation C  • probability of sampling 1 C 2 ( x   )  C xC P (C ) Generative document model C , C , P(C) Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 3 Finite Mixture Problem (2) A 0 B 0.961 A 0 A 0 B 0.780 A 0 B 0 A 0 B 0.980 A 0 B 0.135 A 0.490 B 0.928 B 0 B 0.658 A 0 A 0 A 0.387 A 0.570 B 0 1  A  (0 + 0 + 0 + 0 + 0 + 0 + 0.49 + 0 + 0 + 0.387 + 0.57)  0.132 11 1  B  (0.961 + 0.780 + 0 + 0.980 + 0.135 + 0.928 + 0 + 0.658 + 0)  0.494 9  A  0.229  B  0.449 11 P( A)   0.55 20 9 P( B)   0.45 20 Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 4 Classification Problem Given C , C , P(C) for class A and B compute P(A|x) and P(B|x). Use P( x | C ) P(C ) if x is a discrete variable P(C | x)  P( x) f C ( x) P(C ) P(C | x)  P( x) if x is a continuous variable probability density function f C ( x)  1 2  C  ( x  C ) 2 e Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007. Slides for Chapter 1: Information Retrieval an Web Search 2 C 2 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Part I: Web Structure Mining Chapter 1: Information