Download Density-based methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Subject Name: Data Warehousing and data
Mining
Subject Code: 10MCA542 & IS 74
31/10/2014
Prepared By:
V.Srikanth& Harkiranpreet
Department :
MCA & IS
Date
31/10/2014
:
UNIT 7 CLUSTERING TECHNIQUES
Clustering techniques Overview
Features of Cluster Analysis
Types of Data and computing distance
Types of Cluster analysis methods
Partitional Methods
Density based methods
Quality and validity of cluster analysis
31/10/2014
CLUSTERING TECHNIQUES OVERVIEW
• Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in
other groups.
• A clustering is a set of clusters
• Important distinction between hierarchical and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
31/10/2014
FEATURES OF CLUSTER ANALYSIS
• Given that there are a large number of cluster analysis methods on offer,
we make a list of desired features that an ideal cluster analysis method
should have. The list is given below.
1. Scalability : Data mining problems can be large and therefore it is
desirable that a cluster analysis method be able to deal with small as well
as large problems gracefully.
2. Only one scan of the data set: For large problems, the data must be
stored on the disk and the cost of I/O from the disk can then become
significant in solving the problem.
3. Ability to stop and resume: When the dataset is very large, cluster
analysis may require considerable processor time to complete the task. In
such cases, it is desirable that the task be able to be stopped and then
resumed when convenient.
4. Minimal input parameters: The cluster analysis method should not
expect too much guidance from the user.
31/10/2014
5.
6.
7.
8.
Robustness: Most data obtained from a variety of sources has errors. it is
therefore desirable that a cluster analysis method be able to deal with
noise, outliers and missing values gracefully.
Ability to discover different cluster shapes: Clusters come in different
shapes and not all clusters are spherical. it is therefore desirable that a
cluster analysis method be able to discover cluster shapes other than
spherical.
Different data types: Many problems have a mixture of data types, for
example numerical, categorical and even textual.
Result independent of data input order: It is therefore desirable that a
cluster analysis method not be sensitive to data input order. Whatever the
order, the result of cluster analysis of the same data should be the same.
31/10/2014
TYPES OF DATA AND COMPUTING DISTANCE
• Data sets come in a number of different forms. The data may be
quantitative,binary,nominal or ordinal.
1. Quantitative data is quite common for
example,weight,marks,height,price,salary and count.
2.
Examples of binary data involve gender, marital status etc.
3.
Qualitative nominal data is similar to binary data which may take more
than two values but has no natural order for example religion, foods or
colors.
4.
Qualitative ordinal data is similar to nominal data except that data has an
order associated with it for example, grades A,B,C,D.
31/10/2014
COMPUTING DISTANCE
• Let the distance between two points x and y be D(x,y).We now define a
number of distance measures.
1. Euclidean distance: It is most commonly used to compute distances.
The formula for it is
D(x,y) =(∑(xi-yi)2)/0.5
2. Manhattan distance: The formula for manhattan distance is
D(x,y)= ∑!xi-yi!
3. Chebychev distance : The formula to be applied is
D(x,y)=Max(xi-yi)
4. Categorical data distance: The formula involved is
D(x,y)= (Number of xi-yi)/N.
31/10/2014
TYPES OF CLUSTER ANALYSIS METHODS
• The cluster analysis methods may be divided into the following categories.
1.
Paritional methods 2. Hierarchical methods 3. Density based methods
4. Grid-based methods 5. Model-based methods..
1.
2.
Partitional methods: Partitional methods obtain a single level partition of
objects. These methods usually are based on greedy heuristics that are
used iteratively to obtain a local optimum solution. Given n objects, these
methods make k<n clusters of data and use an iterative relocation method.
Hierarchical methods: Hierarchical methods obtain a nested partition of
the objects resulting in a tree of clusters.
31/10/2014
3.
4.
5.
Density-based methods: In this class of methods, typically for each data
point in a cluster at least a minimum number of points must exist within a
given radius.
Grid-based methods: In this class of methods, the object space rather
than the data is divided into a grid. Grid-based methods are not affected
by data ordering.
Model-based methods: A model is assumed, perhaps based on a
probability distribution. Similarity measurement is based on the mean
values.
31/10/2014
5/4/2017
PARTITIONAL METHODS
• It involves 2 methods 1. The K-Means method and 2. Expectation
Maximization method.
• K-Means Clustering:
•
•
•
•
•
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
31/10/2014
K-MEANS CLUSTERING ALGORITHM
•
•
•
•
Initial centroids are often chosen randomly.
–
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
K-means will converge for common similarity measures mentioned
above.
31/10/2014
•
•
•
Most of the convergence happens in the first few iterations.
–
Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d ).
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
31/10/2014
IMPORTANCE OF CHOOSING INTITIAL
CENTROIDS
Iteration 6
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
x
31/10/2014
0.5
1
1.5
2
SOLUTION TO INTIAL CENTROIDS
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial centroids
• Select more than k initial centroids and then select among these initial
centroids
– Select most widely separated
• Post processing
• Bisecting K-means
– Not as susceptible to initialization issues
31/10/2014
BISECTING K MEANS ALGORITHM
31/10/2014
5/4/2017
2. EXPECTATION MAXIMIZATION METHOD
• In contrast to the k-means method the expectation maximization(EM)
method is based on the assumption that the objects in the data set have
attributes whose values are distributed according to some unknown linear
combination or mixture of simple probability distributions.
• While the k-means method involves assigning objects to clusters to
minimize within group variation ,the EM method assigns objects to
different clusters with certain probabilities in an attempt to maximize
expectation of assignment.
• The EM method involves 2 steps-The first step called the estimation step
involves estimating the probability distribution of the clusters given the
data the second step called the maximization step involves finding the
model parameters that maximize the likelihood of the solution.
31/10/2014
5/4/2017
HIERARCHICAL METHODS
• It involves 2 algorithms:
1. Agglomerative method 2. Divisive method.
1. AGGLOMERATIVE-More popular hierarchical clustering techniqueBasic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4.
Merge the two closest clusters
5.
Update the proximity matrix
6. Until only a single cluster remains
•
Key operation is the computation of the proximity of two clusters
–
Different approaches to defining the distance between clusters
distinguish the different algorithms.
31/10/2014
2. DIVISIVE METHOD
31/10/2014
DENSITY BASED METHODS
•
Density based methods is based upon DBSCAN which is a density-based
algorithm.
–
Density = number of points within a specified radius (Eps)
–
A point is a core point if it has more than a specified number of
points (MinPts) within Eps
• These are points that are at the interior of a cluster
–
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
–
A noise point is any point that is not a core point or a border point.
31/10/2014
DB SCAN ALGORITHM
• Eliminate noise points
• Perform clustering on the remaining points
31/10/2014
QUALITY AND VALIDITY OF CLUSTER
ANALYSIS METHODS
• The quality of a method involves a number of criteria.
1.
Efficiency of the method
2.
Ability of the method to deal with noisy and missing data
3.
Ability of the method to deal with large problems
4.
Ability of the method to deal with a variety of attribute types and magnitudes.
31/10/2014