Download Clustering and its Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
169
Clustering and its Applications
L.V. Bijuraj
Abstract--- Cluster analysis or Clustering is said to be a
collection of objects. It is used in various application in the
real world. Such as data/text mining, voice mining, image
processing, web mining mining and so on. It is important in
real world in certain fields. How and why is important in the
real world and how were the techniques implemented in
several applications are presented.
I.
some minimum number of documents within a cluster. If the
number of the clusters is large, the centroids can be further
clustered to produces hierarchy within a dataset.
Single Pass: A very simple partition method, the single
pass method creates a partitioned dataset as follows:
1.
Make the first object the centroid for the first cluster.
2.
For the next object, calculate the similarity, S, with
each existing cluster centroid, using some similarity
coefficient.
3.
If the highest calculated S is greater than some
specified threshold value, add the object to the
corresponding cluster and re determine the centroid;
otherwise, use the object to initiate a new cluster. If
any objects remain to be clustered, return to step 2.
CLUSTERING
C
LUSTER analysis or clustering is the task of grouping a
set of objects in such a way that objects in the same group
(called cluster) are more similar (in some sense or another) to
each other than to those in other groups (clusters). It is a main
task of exploratory data mining, and a common technique
for statistical data analysis used in many fields, including
machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics
A. Types of Clustering
Cluster: It is said to be “Collection of data objects”.
Where the two types of similarities of clustering’s are:
•
Intraclass similarity - Objects are similar to objects
in same cluster
•
Interclass dissimilarity - Objects are dissimilar to
objects in other clusters
B. Methods of Clustering
•
Partitioning methods
•
Hierarchical methods
•
Density-based methods
•
Grid-based methods
•
Model-based methods
•
Hierarchial Methods
Connectivity based clustering, also known as hierarchical
clustering, is based on the core idea of objects being more
related to nearby objects than to objects farther away. As such,
these algorithms connect "objects" to form "clusters" based on
their distance. A cluster can be described largely by the
maximum distance needed to connect parts of the cluster. At
different distances, different clusters will form, which can be
represented using a dendrogram, which explains where the
common name "hierarchical clustering" comes from: these
algorithms do not provide a single partitioning of the data set,
but instead provide an extensive hierarchy of clusters that
merge with each other at certain distances.
•
Hierarchical Agglomerative Methods
The hierarchical agglomerative clustering methods are
most commonly used. The construction of an hierarchical
agglomerative classification can be achieved by the following
general algorithm.
1.
Find the 2 closest objects and merge them into a
cluster
2.
Find and merge the next two closest points, where a
point is either an individual object or a cluster of
objects.
•
Partitioning Methods
The partitioning methods generally result in a set of M
clusters, each object belonging to one cluster. Each cluster
may be represented by a centroid or a cluster representative;
this is some sort of summary description of all the objects
contained in a cluster. The precise form of this description will
depend on the type of the object which is being clustered. In
case where real-valued data is available, the arithmetic mean
of the attribute vectors for all objects within a cluster provides
an appropriate representative; alternative types of centroid
may be required in other cases, e.g., a cluster of documents
can be represented by a list of those keywords that occur in
If more than one cluster remains, return to step 2
Two main approaches:
Where the two approches will be done on it. where
•
Agglomerative approach
•
Divisive approach
Individual methods are characterized by the definition used
for identification of the closest pair of points, and by the
L.V. Bijuraj, Dept. of BCA and SS, Sri Krishna Arts and Science College,
Coimbatore, India.
ISBN 978-93-82338-79-6
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
means used to describe the new cluster when two clusters are
merged.
•
Density Based Clustering
In density-based clustering clusters are defined as areas of
higher density than the remainder of the data set. Objects in
these sparse areas - that are required to separate clusters - are
usually considered to be noise and border points.
Density Reachability - A point "p" is said to be density
reachable from a point "q" if point "p" is within ε distance
from point "q" and "q" has sufficient number of points in its
neighbors which are within distance ε.
Density Connectivity - A point "p" and "q" are said to be
density connected if there exist a point "r" which has sufficient
number of points in its neighbors and both the points "p" and
"q" are within the ε distance. This is chaining process. So, if
"q" is neighbor of "r", "r" is neighbor of "s", "s" is neighbor of
"t" which in turn is neighbor of "p" implies that "q" is
neighbor
of
"p".
170
3) Does not work well in case of high dimensional data.
•
Group instances based on attributes into k groups
High intra-cluster similarity; Low inter-cluster similarity
Cluster similarity is measured in regards to the mean value
of objects in the cluster.
•
•
•
•
•
Let X = {x1, x2, x3, ..., xn} be the set of data
points. DBSCAN requires two parameters: ε (eps) and the
minimum number of points required to form a cluster
(minPts).
2) Extract the neighborhood of this point using ε (All
points which are within the ε distance are neighborhood).
3) If there are sufficient neighborhood around this point
then clustering process starts and point is marked as visited
else this point is labeled as noise (Later this point can become
the part of the cluster).
•
•
•
•
Advantages
1) Does not require a-priori specification of number of
clusters.
2) Able to identify noise data while clustering.
3) DBSCAN algorithm is able to find arbitrarily size and
arbitrarily shaped clusters.
The dataset is partitioned into K clusters and the data
points are randomly assigned to the clusters resulting
in clusters that have roughly the same number of data
points.
For each data point:
• Calculate the distance from the data point to each
cluster.
•
•
6) This process continues until all points are marked as
visited.
•
There are always K clusters.
There is always at least one item in each cluster.
The clusters are non-hierarchical and they do not
overlap.
Every member of a cluster is closer to its cluster than
any other cluster because closeness does not always
involve the 'center' of clusters.
The K-Means Algorithm Process
4) If a point is found to be a part of the cluster then its ε
neighborhood is also the part of the cluster and the above
procedure from step 2 is repeated for all ε neighborhood
points. This is repeated until all points in the cluster is
determined.
5) A new unvisited point is retrieved and processed,
leading to the discovery of a further cluster or noise.
First, select K random instances from the data –
initial cluster centers
Second, each instance is assigned to its closest (most
similar) cluster center
Third, each cluster center is updated to the mean of
its constituent instances
Repeat steps two and three till there is no further
change in assignment of instances to clusters
K-Means Algorithm Properties
•
•
•
Algorithmic steps for DBSCAN Clustering
1) Start with an arbitrary starting point that has not been
visited.
K-means Algorithm
K-Means algorithm is a type of partitioning method
•
If the data point is closest to its own cluster,
leave it where it is. If the data point is not closest
to its own cluster, move it into the closest
cluster.
Repeat the above step until a complete pass through
all the data points results in no data point moving
from one cluster to another. At this point the clusters
are stable and the clustering process ends.
The choice of initial partition can greatly affect the
final clusters that result, in terms of inter-cluster and
intracluster distances and cohesion.
Example
Let us apply the k-Means clustering algorithm to the same
example as in the previous page and obtain four clusters:
•
Disadvantages
1) DBSCAN algorithm fails in case of varying density
clusters
2) Fails in case of neck type of dataset.
ISBN 978-93-82338-79-6
Food item # Protein content, P Fat content, F
Food item #1
1.1
60
Food item #2
8.2
20
Food item #3
4.2
35
Food item #4
1.5
21
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
Food item #5
Food item #6
Food item #7
7.6
2.0
3.9
171
difficult to estimate the distance, one has to useeuclidean
metric to measure the distance between two points to assign a
point to a cluster.
15
55
39
Let us plot these points so that we can have better
understanding of the problem. Also, we can select the three
points which are farthest apart.
II.
APPLICATIONS OF CLUSTERING
Where clustering is been applied in various fields were
some of the applications are:
Use of Clustering in Data Mining: Clustering is often one
of the first steps in data mining analysis. It identifies groups of
related records that can be used as a starting point for
exploring further relationships. This technique supports the
development of population segmentation models, such as
demographic-based customer segmentation. Additional
analyses using standard analytical and other data mining
techniques can determine the characteristics of these segments
with respect to some desired outcome. For example, the
buying habits of multiple population segments might be
compared to determine which segments to target for a new
sales campaign.
We see from the graph that the distance between the points
1 and 2, 1 and 3, 1 and 4, 1 and 5, 2 and 3, 2 and 4, 3 and 4 is
maximum.
Thus, the four clusters chosen are:
Cluster number Protein content, P Fat content, F
C1
1.1
60
C2
8.2
20
C3
4.2
35
C4
1.5
21
Also, we observe that point 1 is close to point 6. So, both
can be taken as one cluster. The resulting cluster is called C16
cluster. The value of P for C16 centroid is (1.1 + 2.0)/2 = 1.55
and F for C16 centroid is (60 + 55)/2 = 57.50.
Upon closer observation, the point 2 can be merged with
the C5 cluster. The resulting cluster is called C25 cluster. The
values of P for C25 centroid is (8.2 + 7.6)/2 = 7.9 and F for
C25 centroid is (20 + 15)/2 = 17.50
The point 3 is close to point 7. They can be merged into
C37 cluster. The values of P for C37 centroid is (4.2 + 3.9)/2 =
4.05 and F for C37 centroid is (35 + 39)/2 = 37.
The point 4 is not close to any point. So, it is assigned to
cluster number 4 i.e., C4 with the value of P for C4 centroid as
1.5 and F for C4 centroid is 21.
For example, a company that sales a variety of products
may need to know about the sale of all of their products in
order to check that what product is giving extensive sale and
which is lacking. This is done by data mining techniques. But
if the system clusters the products that are giving less sale then
only the cluster of such products would have to be checked
rather than comparing the sales value of all the products. This
is actually to facilitate the mining process.
A. Application of Clustering in Text Mining
Text mining, also referred to as text data mining, roughly
equivalent to text analytics, refers to the process of deriving
high-quality information from text. High-quality information
is typically derived through the devising of patterns and trends
through means such as statistical pattern learning. Text mining
usually involves the process of structuring the input text
(usually parsing, along with the addition of some derived
linguistic features and the removal of others, and subsequent
insertion into a database), deriving patterns within
the structured data, and finally evaluation and interpretation of
the output. 'High quality' in text mining usually refers to some
combination of relevance, novelty, and interestingness.
Typical text mining tasks include text categorization, text
clustering, concept/entity extraction, production of granular
taxonomies, sentiment analysis, document summarization, and
entity relation modeling
Text mining consists of extraction information from
hidden patterns in large text-data collections
Finally, four clusters with three centroids have been
obtained.
Cluster number Protein content, P Fat content, F
C16
1.55
57.50
C25
7.9
17.5
C37
4.05
37
C4
1.5
21
In the above example it was quite easy to estimate the
distance between the points. In cases in which it is more
ISBN 978-93-82338-79-6
Proceedings of National Conference on New Horizons in IT - NCNHIT 2013
172
It is the mining of the data in the web page…in the
database websites.
C. Some other Applications of Clustering
Where the clustering is been used in Fields of applications
on it.
•
•
•
•
•
•
•
•
•
•
The query is given in the system were the given query is
been founded by using the search navigation system.Where
the documents based on query search is been given here in the
diagram.Where is been extracted using name extractor.From
the authorisation list the ranking details are viewed on it.
Data Mining
Pattern recognition
Image analysis
Bioinformatics
Machine Learning
Voice minig
Image processing
Text mining
Web cluster engines
Whether report analysis
III.
CONCLUSION
Where clustering is used in various fields of our real life. It
is a process which are give here and explained few on it. Also
is used in various fields are unknown on it.
B. Working of Cluster in the Search Engines
REFERENCE
[1]
[2]
[3]
[4]
[5]
[6]
http://msdn.microsoft.com/en-us/library/ee825018(v=cs.20).aspx
http://publib.boulder.ibm.com/infocenter/iseries/v5r4/index.jsp?topic=%
2Frzaig%2Frzaigconcepts.htm
http://ec.europa.eu/enterprise/policies/innovation/files/clusters-workingdocument-sec-2008-2635_en.pdf
http://members.tripod.com/asim_saeed/paper.html
http://docs.oracle.com/cd/E10736_01/doc/server.341/e11080/concepts.ht
ml
http://en.wikipedia.org/wiki/Text_mining
Where information retrival system is works in the web
documents on it. The document source is said to be the
documents of the web page. The query is said to be the search
engine. Using cluster the documents are classified based on
the query in the information retrival system. The ranked
documents represent the relevant details present in the
documents which are relevant to the search of the query.
ISBN 978-93-82338-79-6