Download Web Document Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cross-site scripting wikipedia , lookup

Semantic Web wikipedia , lookup

Transcript
Web Document Clustering
Department of Electrical and Computer Engineering
Seyed HamidReza Mohammadi
Fall 2013
1
Outlines
What is Clustering?
Applications
Challenges
Document Clustering Techniques
Web Search Results Clustering
References
2
What is[Document]Clustering?
Act of grouping similar object
(document) into sets that each set
called CLUSTER(Auto Classification)
Max similarity in each set.
Min similarity between sets.
3
Applications






Finding Similar Documents
Organizing Large Document Collections
Duplicate Content Detection
Recommendation System
Social Networks
Search Optimization
4
Challenges




Selecting appropriate features
Selecting an appropriate similarity measure
Selecting an appropriate clustering method
Implementing the clustering algorithm in an
efficient way
 assessing the quality of the clustering
5
Document Clustering Techniques
1)
2)
3)
4)
5)
6)
Hierarchical
Partitional
Graph Based
Neural Network
Fuzzy
Probabilistic
6
1-Hierarchical
i. Agglomerative
ii. Divisive
7
2-Partitional
K-Means
8
3-Graph Based
9
Technique Cont.
4) Neural Network
n doc inputs , k cluster outputs of network
5) Fuzzy (c-means)
Use member function to clustering
6) Probabilistic
Calculate document probability of
belonging to cluster
10
Web Search Results Clustering
11
Web Search Results Clustering
Why?
 Flat ranked list not enough
 Ignore Relationships between the
results( cluster hypothesis)
 Irrelevant Returned Pages
 Query Limitation( few keyword)
 phenomena of synonymy & polysemy
 spam
12
Web Search Results Clustering
Benefits :
 Find information easily
 Faster way to find out poorly query
 Reduce user give up before reach to
desired result
13
Web Search Results Clustering
Main issues:
Speed
Immediate response to query
Flexibility
Web content changes constantly
User-oriented
Main goal is to aid the user in finding sought
information
Online or offline clustering
14
Web Search Results Clustering
Main issues:
What to use as input
Entire documents
Snippets
Structure information (links)
Other data (i.e. click-through)
How to define similarity?
Content (i.e. vector-space model)
Link analysis
Usage statistics
15
Web Search Results Clustering
Systems:
Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC
16
Web Search Results Clustering
Grouper
17
References
[1] O. Zamir and O.Etzioni :Web Document Clustering. University of
Washington, 2004
[2] N. Oikonomakou : A Review of Web Document Clustering
Approaches. Athens University, 2005
[3] M.Steinbach ,G.Karypis : A Comparison of Document Clustering
Techniques. University of Minnesota, 2007
[4] H.Yang : A Document Clustering Algorithm for Web Search
Engine Retrieval System. Yunnan University, 2010.
[5] O. Zamir and O.Etzioni : Grouper: a dynamic clustering interface to
Web search results. University of Washington, 200.
[6] P.T.Santhiya1, R.Tamizharasi : A Min-Hash Algorithm For Clustering
Web Documents. S.A. Engineering College, Chennai-77, India, 2013.
[6 D.Húsek, J.Pokorný : DATA CLUSTERING: FROM DOCUMENTS TO
THE WEB. Academy of the Sciences of the Czech Republic, 2008.
18
Thank You
?
19