Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Document Clustering Department of Electrical and Computer Engineering Seyed HamidReza Mohammadi Fall 2013 1 Outlines What is Clustering? Applications Challenges Document Clustering Techniques Web Search Results Clustering References 2 What is[Document]Clustering? Act of grouping similar object (document) into sets that each set called CLUSTER(Auto Classification) Max similarity in each set. Min similarity between sets. 3 Applications Finding Similar Documents Organizing Large Document Collections Duplicate Content Detection Recommendation System Social Networks Search Optimization 4 Challenges Selecting appropriate features Selecting an appropriate similarity measure Selecting an appropriate clustering method Implementing the clustering algorithm in an efficient way assessing the quality of the clustering 5 Document Clustering Techniques 1) 2) 3) 4) 5) 6) Hierarchical Partitional Graph Based Neural Network Fuzzy Probabilistic 6 1-Hierarchical i. Agglomerative ii. Divisive 7 2-Partitional K-Means 8 3-Graph Based 9 Technique Cont. 4) Neural Network n doc inputs , k cluster outputs of network 5) Fuzzy (c-means) Use member function to clustering 6) Probabilistic Calculate document probability of belonging to cluster 10 Web Search Results Clustering 11 Web Search Results Clustering Why? Flat ranked list not enough Ignore Relationships between the results( cluster hypothesis) Irrelevant Returned Pages Query Limitation( few keyword) phenomena of synonymy & polysemy spam 12 Web Search Results Clustering Benefits : Find information easily Faster way to find out poorly query Reduce user give up before reach to desired result 13 Web Search Results Clustering Main issues: Speed Immediate response to query Flexibility Web content changes constantly User-oriented Main goal is to aid the user in finding sought information Online or offline clustering 14 Web Search Results Clustering Main issues: What to use as input Entire documents Snippets Structure information (links) Other data (i.e. click-through) How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics 15 Web Search Results Clustering Systems: Scatter/Gather Grouper Carrot2 Vivisimo Mapuccino (Su et. al. 2001) SHOC 16 Web Search Results Clustering Grouper 17 References [1] O. Zamir and O.Etzioni :Web Document Clustering. University of Washington, 2004 [2] N. Oikonomakou : A Review of Web Document Clustering Approaches. Athens University, 2005 [3] M.Steinbach ,G.Karypis : A Comparison of Document Clustering Techniques. University of Minnesota, 2007 [4] H.Yang : A Document Clustering Algorithm for Web Search Engine Retrieval System. Yunnan University, 2010. [5] O. Zamir and O.Etzioni : Grouper: a dynamic clustering interface to Web search results. University of Washington, 200. [6] P.T.Santhiya1, R.Tamizharasi : A Min-Hash Algorithm For Clustering Web Documents. S.A. Engineering College, Chennai-77, India, 2013. [6 D.Húsek, J.Pokorný : DATA CLUSTERING: FROM DOCUMENTS TO THE WEB. Academy of the Sciences of the Czech Republic, 2008. 18 Thank You ? 19