Download clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data vault modeling wikipedia , lookup

Data analysis wikipedia , lookup

Business intelligence wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
INTRODUCTION
An image is a 2-D function ( f(x,y) ) from the spatial coordinates x and y to the
intensity or the gray value of that point in space. Each point (x,y) in an image is called a
pixel. If the image is colored then there are three different components in the intensity
representation of the point, i.e. red, blue, green. In a monochromatic image the three
components are equal for every point. If the domain and the range of the image function
are discrete then the image is called a digital image
Digital image processing is the field that deals with the processing of the digital
images with the help of a computer. It comprises of the procedures like the reducing
noise, contrast enhancement, image sharpening and smoothing, segmentation, description
of objects, classification of objects, etc.
The quality of the satellite image is highly dependent on various natural
phenomenon like the atmospheric conditions, illumination due to the sun and various
artificial sources and also conditions like the position of the satellite when the image was
taken, etc.
Image classification is the process of making quantitative decisions from image
data, grouping pixels or regions of the image into classes intended to represent different
physical objects or types. The output of the classification process may be regarded as a
thematic map rather than an image. The majority of the classification techniques use
mainly the radiometric data (pixel value) present in the image with little or no reference
to the spatial variation.
Suppose we have an n-band image, and the pixel value in each band can take
k different values. The number of the possible coordinates in the n-dimensional pixel
value space is kn, a number that can very easily exceed a million. However it is very
unlikely that the image represents a million or more different classes of data, or that we
could make use of information if we did. What we require is some simplification of the
data in the n-dimensional pixel value space, identifying a volume within this space as
representing a single class of data.
In our project we will take up the first step to unsupervised classification that is
Clustering of the image data in which the entire image is analyzed without reference to
any training data. The aim of the analysis is to identify distinguishable clusters of data in
the n-dimensional pixel value space. The clustering can be further used in image
classifications.
Multi Spectral Classification:
In this we mainly have:
 Supervised Classification
 Unsupervised Classification
 Hybrid Classification
Supervised Classification:
In this type of classification the image analyst
supervises the pixel categorization process by
specifying to the computer algorithm, numerical
descriptors of the various land cover types
present in a scene.
Unsupervised Classification:
In this type of classification the image data is
first classified by aggregating them into the
natural spectral groupings or clusters, present in
the scene. Then the image analyst determines
the land cover identity of these spectral groups
by comparing the classified image data to
ground reference data.
Hybrid Classification:
This type of classification involves aspects of
both
supervised and the unsupervised
classification and are aimed at improving the
accuracy or efficiency (or both) of the
classification process.
Unsupervised Classification
This family of classifiers involves algorithms that examine the unknown pixels in
an image and aggregate them into a number of classes based on the natural groupings or
clusters present in the image values. The basic premise is that values within a given cover
type should be close together in the measurement space, whereas data in different classes
should be comparatively well separated. The classes that result from unsupervised
classification are spectral classes. Because they are based solely on natural groupings in
the image values, the identity of the spectral classes will not be initially known. The
analyst must compare the classified data with some form of the reference data to
determine the identity and informational value of the spectral class.
Thus in unsupervised approach we determine spectrally separable classes and
then define their informational utility.
There are numerous clustering algorithms that can be used to determine the natural
spectral grouping present in a data set. One common form of clustering is the process, in
which the program reads through the entire data set and builds clusters. There is a mean
vector associated with each cluster. A minimum distance classification to means
algorithm is applied on a pixel-by-pixel basis where each pixel is assigned to the clusters
initially created. Therefore we will create cluster structures to be used by the classifier.
The first step in an unsupervised classification is to cluster the image data and
we implemented this basic clustering approach in turbo C.
Clustering Techniques
There are many clustering algorithms available:
a) Clustering Method
b) Non Hierarchical clustering Method:
Nearest Centroid Sorting-fixed number of clusters

 Forgy’s Method and Jancy’s vacant
 Macqueen’s K-Means Methods and variant




Nearest Centroid Sorting-variable number of clusters
Macqueen’s K-Means Methods with coarsening and refining
parameters.
Wishart’s variant on K-means.
Isodata Method.
c) Hierarchical clustering Methods:





The central Agglomerative procedure
Stored Matrix Approach
Stored Data Approach
Sorted Matrix Approach
Parks Clustering Program
Cluster Analysis
a) Need for Cluster Analysis Algorithm
Even though little or nothing about the category structure can be
stated in advance, one frequently has atleast some latent notions of the desirable
and unacceptable features for a classification scheme. In operational terms the
analyst usually is informed sufficiently about the problem that he can
distinguish about between good and bad category structures when confronted
with them. The number of ways of sorting ‘n’ observations into ‘m’ groups is a
stirring number of the second kind.
k=m
S(m)n =1/m!(-1)m-k
k=0
m
Ck kn
It would take an inordinately long period of time to examine so many
alternatives and the ability to make meaningful distinctions between cases
would diminish rapidly. It is generally the intent of the cluster analysis
algorithm to emulate some human efficiency and find an acceptable solution
while considering only a small number of the alternatives.
b) Uses of Cluster Analysis
Cluster analysis has been employed as an effective tool in scientific
inquiry. One of its most useful roles is to generate hypotheses about category
structures. An algorithm can assemble observations into groups which prior
misconceptions and ignorance would otherwise preclude.
The result of cluster analysis can contribute directly to the development of
classification schemes. In more theorectical vein, cluster analysis can be used to
develop inductive generalizations.
c) Clustering Criteria
The terms cluster is often left undefined and taken as a primitive notion
in much the same manner as “point” is treated in geometry. But when it comes
to finding clusters in real image data, the term bears a definite meaning. The
choice of the clustering criterion is tantamount to defining a cluster. It may not
be possible to say what a cluster is in abstract terms but it can always be defined
constructively through statement of the criterion and implementing algorithm.
Many criteria for clustering have been proposed and used. In some
problem, there is a natural choice while in others almost any criterion might
have status as the candidate.
Problem Statement
Given:
Samples of multi-spectral satellite images from IRS satellites.
Problem: To identify distinguishable clusters of data in an n-dimensional pixel
value image.
Result:
Different Clusters obtained
Clustering Algorithm
 The analyst may be required to supply four types of information:
 R, a radius in spectral space used to determine when a new cluster
should be formed.
 C, a spectral space distance parameter used when merging clusters.
 N, the number of pixels to be evaluated between each merging of the
clusters.
 Cmax the maximum number of clusters to be identified by the
algorithm.
 The multispectral data set, is sequentially evaluated pixel by
pixel from left to right.
 Firstly, we let the brightness value associated with the first pixel
represents the mean data vector of a cluster. It is an ndimensional mean data vector with n being the number of bands
used in the unsupervised classification.
 Pixel1 is considered as cluster1, Pixel2 is considered as cluster2.
Spectral distance(D) between cluster1 and cluster2 is calculated
 If the spectral distance between cluster1 and cluster2 is more
than R, cluster2 remains cluster2.
 If the spectral distance between cluster1 and cluster2 is less than
R, then the mean data vector of cluster1 becomes the average of
the first and second pixel brightness values and the weight of the
cluster1 becomes 2.
 This cluster accumulation continues until the number of pixels
evaluated is greater than N.
 At that point, the program stops evaluating the individual pixels
and looks closely at the nature of the clusters obtained so far. It
calculates the distance between each cluster and every other
cluster. Any two clusters separated by a distance less than C are
merged.
 After merging a new cluster is obtained whose mean vector is the
weighted average of the two original clusters and the weight is
the sum of the two individual weights. This process continues
until there are no clusters with a separation distance of less than
C.
 It is necessary to evaluate the location of the clusters and
combine some clusters.
 If the number of clusters formed is greater than Cmax it doesnot
form new clusters but uses minimum distance to means algorithm
to classify all the pixels in one of the Cmax clusters,
 The analyst usually produces a display depicting to which cluster
each pixel was assigned.