Download PATTERN CLASSIFICATION By

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
PATTERN CLASSIFICATION
By: Dr. Rajeev Srivastava
PATTERN
CLASSIFICATION
Deals with:
• Concept of classifiers
• Evaluation of classifiers
• Structural and syntactic recognition
methods
• Clustering algorithms
INTRODUCTION
The process of comparing an unknown
object with stored patterns to recognize
the unknown object is called classification.
 It is the process of applying a label or
pattern class to an unknown instance.
 It is the study of how machines can
observe the environment , learn to
distinguish patterns of interest and make
reasonable decisions about the categories
of the patterns.

PATTERN CLASSIFICATION DESIGN CYCLE
IMAGE
ACQUISITION
MAIN PROGRAM
IMAGE PREPROCESSING
ALGORITHM
IF SATISFACTORY
EVALUATION OF RESULTS
IF NOT
REPEAT WHOLE PROCESS
EXTRACTION OF FEATURES
FEATURE DATA
COLLECTION AND
PREPROCESSING
LEARNING



1.
2.
3.
One of the important component in pattern
recognition is the ability of the system to learn
from the data.
Learning means the development of algorithms by
acquiring knowledge from the given empirical data.
Various learning approaches are :
Supervised Learning
Unsupervised Learning
Reinforced Learning
SUPERVISED LEARNING
It needs an explicit supervision over the
system.
 A cost/label is provided for each pattern
in a training set ,based on which the
system learns to generate a concept to
classify the pattern.
 Once the system becomes a learnt
system , the test data is supplied to test
the system.

UNSUPERVISED LEARNING
There is no such explicit supervision
required for this unsupervised system , the
system itself learns by trial and error
method.
 The instances used form groups or clusters ,
based on similarity measures.
 The goal of clustering is similar to that of
classification , however it is performed
where domain model is not available.
 The user has to provide the number of
clusters they desire.

REINFORCED LEARNING
Here the learning system is binary in
decision outputs.
 The binary feedback of right or wrong is
sent back to the input and is used to
reinforce learning from the data.
 The learning continues unless the learning
system is right , given only two binary
assessment to be right or wrong.

STAGES OF PATTERN RECOGNITION
DESIGN CYCLE
Stages in the pattern recognition design
cycle include:
1) Feature data collection and
preprocessing.
2) Choosing the pattern recognition model.
3) Testing and evaluation of the
performance of the pattern recognition
task.
FEATURE DATA COLLECTION AND
PREPROCESSING.


1.
2.
3.
4.
This is one of the important phases in
pattern recognition because the quality of
the pattern recognition task depends on the
quality of the input feature data.
The procedures in this phase are related to :
Collection of training data.
Noise removal
Identifying the missing value
Performing data transformations to
normalize and condition the data.
TRAINING DATASET
The training dataset comprises of vectors
, patterns ,cases , samples , or observation
of an object.
 The collection of these data is called an
image dataset or a feature dataset ,
this is stored in a dataset called feature
database.
 Some of the characteristics of the dataset
are high dimensionality and sparseness.

COMPRESSION
Data objects with a large number of
bands increase the computational
complexity of the image and the
sparseness of the dataset also poses
problems such as poor quality.
 Compression can be applied to maintain
the object at reasonable size in these
cases.

PROBLEMS IN FEATURE DATA COLLECTION



Some of the factors that may affect the
quality and reliability of the results are noise
, artefact , bias , imprecision , and inaccuracy
of the input data.
Some common data collection problems are
the presence of outliers , missing and
inconsistent values , and duplicate data.
Some of the qualities of good data for
training the classifier are timeliness ,
relevence , and self-sufficiency.
PATTERN CLASSIFICATION MODELS
Template matching approach
 Classification based approach- statistical & syntactic
 Artificial Neural Networks(ANN) approach

TEMPLATE MATCHING
Also known as MATCHED FILTERING.
 This technique compare portions of images
against one another.
 The target object to be identified is defined as a
template.
 The template is then superimposed on and
correlated with the image.
 The correlation is high if there is a perfect
match between the template and the image.
 Based on the highest correlation value the
degree of match can be determined.

TEMPLATE MATCHING METHODS

The matching process moves the template image to all possible
positions in a larger source image and computes a numerical
index that indicates how well the template matches the image
in that position.

The correlation between the template and the image replaces
the center pixel of the mask in the resultant image.

Match is done on a pixel-by-pixel basis.

The maximum value indicates the best match.
I(x,y)
x,y
Correlation
O(x,y)
x,y
Template Image
Input Image
Output Image
TYPES OF TEMPLATE MATCHING

There are basically 2 types of template
matching implementations:
1.
2.
BI-LEVEL IMAGE TEMPLATE
MATCHING
GREY-LEVEL IMAGE TEMPLATE
MATCHING.
BI-LEVEL IMAGE TM
Template is a small image, usually a bi-level
image.
 Find template in source image, with a
Yes/No approach.

Template
Source
GREY-LEVEL IMAGE TM
When using template-matching scheme on
grey-level image it is unreasonable to expect a
perfect match of the grey levels.
 Instead of yes/no match at each pixel, the
difference in level should be used.

Template
Source Image
EUCLIDEAN DISTANCE
Let I be a gray level image
 and g be a gray-value template of size
nm.

d ( I , g , r , c) 
n
m
2
 I (r  i, c  j)  g (i, j)
i 1 j 1

In this formula (r, c) denoted the top left
corner of the template g.
CORRELATION



Correlation is a measure of the degree to
which two variables agree, not necessary in
actual value but in general behavior.
The two variables are the corresponding
pixel values in two images, template and
source.
If we assume f(x,y) as the given image and
w(x,y) is the template , then correlation of
the image and the template is given by:
C(x,y) = 𝜶 𝜷 𝒘(𝜶, 𝜷)f(x+𝜶, 𝒚 + 𝜷)
GREY-LEVEL CORRELATION FORMULA
cor 

N 1
i 0
N 1
( xi  x )   yi  y 
N 1
2




x

x

y

y
 i
 i
i 0
2
i 0
𝑥𝑖 is the template gray level image
𝑥 is the average grey level in the template image
𝑦𝑖 is the source image section
𝑦 is the average grey level in the source image
N is the number of pixels in the section image
(N= template image size = columns * rows)
The value ‘cor’ is between –1 and +1,
with larger values representing a stronger relationship
between the two images.
DISADVANTAGES OF TEMPLATE MATCHING
No variation in scale or orientation is
permitted.
 It involves large calculations when used
for higher dimensions , hence feature
based schemes are preferred.

CLASSIFICATION
It is a supervised learning method.
 Classification involves two phases:
1. Training phase:

A classifier first need to be trained or better say it should
learn the complex relationship between the input image
features using the training data.
2. Testing phase:
After the learning process is over the classifier is called a
‘learnt system’ and produces a classification model ,
therefore the classifier assigns level either as correct or
incorrect.
CLASSIFICATION SCHEME
UNKNOWN
IMAGE OR
OBJECT
TEST
FEATURES
KNOWN
OBJECT
FEATURES
LEARNING
ALGORITHM
CLASSIFICATION
MODEL
LABEL
TRAINING PHASE
In this phase , the classifier algorithm is fed
with a large set of known dataset, called
training data or labelled data.
 A dataset is required to train the classifier
to classify input .
 These attributes are called input features ,
attributes or independent variables and they
should be large and representative in nature.
 Once training phase is over the data driven
classification model is created.

TESTING PHASE


1.
In this phase the constructed model is tested
and evaluated with unknown test data.
The model can be either :
Descriptive-It can explain its classification
decision.
e.g. - decision tree based classifier.
2. Predictive-It cant explain its decision.
e.g. -neural network based classifiers.
TYPES OF CLASSIFIERS (Based on input )


1.
2.
The difference between the classifiers lies
only on the nature of the data.
There can be 2 types of classifiers:
PIXEL BASED: The input to the classifier is
raw pixel data , the classifier in this case
takes images which has several pixel of the
required regions.
FEATURE BASED: This technique extract
the features of the image such as size ,
shape , location , texture which are then
used for classification.
FACTORS AFFECTING PERFORMANCE OF A
CLASSIFIER

Generally the performance of the classifier
depends on these factors:

Nature of data : A classification model depends
on the availability of good quality training data ,
another problem is that of missing data ,missing
data may be unintentional or deliberate.

Nature of learning: ‘ Over-fitting of the model
’: The learning process should not take more
than necessary it leads to a generalization error.
CLASSIFIER DESIGN
PARAMETRIC
TECHNIQUES
CLASSIFICATION
ALGORITHM
STATISTICAL
TECHNIQUES
NON-PARAMETRIC
TECHNIQUES
NON STATISTICAL
TECHNIQUES
SYNTACTIC
TECHNIQUES
STRUCTURAL
TECHNIQUES
DECISION
THEORETIC
TECHNIQUES
PROBABILISTIC
TECHNIQUES
HYBRID TECHNIQUES
STATISTICAL CLASSIFIERS
Statistical classifiers use statistical principles
for deriving models from given training
dataset using statistical learning techniques.
 These are of two types:
1. Parametric classifier
2. Non parametric classifier

PARAMETRIC CLASSIFIER
These classifiers takes a set of training data
and construct a classification model.
 The parameters are estimated by assuming a
probability distribution or density for each
data set.
 Then statistical parameters such as mean and
variance are found.
 These are of two types based on the
techniques they use:

Decision theoretic techniques
2. Probabilistic techniques
1.
DECISION THEORETIC METHODS
 Often called DISCRIMINANT
FUNCTION ANALYSIS
 The idea used here is to classify the object
by designing a decision boundary or
discriminating functions to separate the
feature vector clusters in the feature space.
 The decision function is designed so as to
give different responses to different classes.
 E.g.
LDA(linear discriminant analysis).
LDA(linear discriminant analysis).







The idea here is to use decision functions to discriminate
the input features.
let x=(𝑥1 ,𝑥2 .....𝑥𝑛 )𝑇 represent the n-dimensional vectors.
let the number of classifiers be k .
Here we design k decision functions
𝑑1 𝑥 , 𝑑2 𝑥 , 𝑑3 𝑥 … . . 𝑑𝑘 𝑥 .
The instance is classified as class i and not j if:
𝑑𝑖 𝑥 >𝑑𝑗 𝑥 ; i ≠ j for i,j=1,2....k
Then the decision boundary is given as :
𝑑𝑖 𝑥 − 𝑑𝑗 𝑥 =0
The decision rule can be designed as: Assign the instance
to the class i if 𝑑𝑖𝑗 𝑥 >0 and assign the instance to j if
𝑑𝑖𝑗 𝑥 <0 .
PROBABILISTIC TECHNIQUES
These use probabilistic techniques for
classification.
 These are based on two probability concepts:
prior probability & conditional probability
 One of the most popular classifier based on
this is the Bayesian classifier.

 Bayesian principle:
One can find the inverse probability P(i/x) from P(x/i) and
P(i) from the Bayes theorem ,given by:
P(𝑖
𝑥) =
𝑃
𝑥
𝑖
𝑃(𝑖)
𝑃(𝑥)
BAYESIAN CLASSIFIER
The Bayesian classifier requires three piece of information:
• P(𝐶𝑖 ) - Prior probability of the class i
𝑥
• P( ) - Conditional probability that the class i has x. This can be
𝑖
calculated from the training data table
𝑥
𝑖
• P(𝑥) –Sum of P( ) over the entire dataset. This information is
not probability information , but serves as a normalization factor.
There are four types of Bayesian classifier as shown:
Bayesian
principle
Maximum
likelihood
classifier
Minimum
Distance
classifier
Minimum risk
classifier
Bayesian classifier
for multiple features
BAYESIAN CLASSIFIER: ALGORITHM

1.
2.
3.
4.
The algorithm for finding the Bayesian
classifier :
Train the classifier with the training images or
labeled featured data
Compute the probability P(i) using intuition
based on experts’ opinion , or using
histogram-based estimation.
Compute P(i/x)
Find the maximum P(i/x) and assign the
unknown instance to that class.
PROS AND CONS OF BAYESIAN
CLASSIFIERS

The Bayesian classifiers have advantages
because:
They are much easy to use.
2. They require only one scan of the training set.
3. They are not affected much by missing values.
4. They produce good results for datasets with simple
relationships.
1.

The only disadvantage of the Bayesian classifiers is
that it cant be used for continuous data.
MAXIMUM LIKELIHOOD CLASSIFIER
According to Bayesian Maximum Likelihood classifier , the
instance is assigned into a class i for which P(i/x) maximum.
 Suppose the attributes are many (m) independent variables ,
this is given as :

𝑷(𝒙/𝒊) =
𝒎
𝒌=𝟏 𝑷(𝒙𝒊𝒌 /𝒊)
In other words the instance having many attributes is
assigned to class i and not to class j if:
P(i/x) = P(j/x)
 This resultant algorithm is called Maximum Likelihood
Classifier.
 If the attributes are assumed to be independent the same
classifier is then called Naive Bayesian Classifier.

MINIMUM DISTANCE CLASSIFIER
When the training set is of many images it is
easier to approximate P(x/i) as a function
with fewer parameters.
 This approximation of the input data is in the
form of a Gaussian distribution.
 This kind of approximation is called
Parametric approximation.

PARAMETRIC APPROXIMATION

Parametric approximation is given by
𝒙−𝒎𝒊
𝟐
𝟐
𝟐 𝝈𝒊 }
P(x/i)=(𝟏 𝟐𝝅𝝈𝒊 ) ∗ 𝒆{
𝑚𝑖 𝑎𝑛𝑑 𝜎𝑖 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑖.

Since the class is multi dimensional , the mean becomes a covariance
matrix 𝒊 .

So the resultant formula is:
P(x/i)=P(i) * {𝟏
(𝒙−𝒎𝒊 )𝑻
(𝒙−𝒎𝒊 )
(𝟐𝝅)𝒅 det
𝒊
}*𝒆−𝟏/𝟐(𝒙−𝒎𝒊

Here ,

The term 𝟏

Similarly ,by taking log and simplifying ,this expression yields:
𝒊
(𝟐𝝅)𝒅 det
)𝑻
𝒊
(𝒙−𝒎𝒊 )
is called MAHALANOBIS DISTANCE.
𝒊
can be ignored as it is a scaling factor.
P(𝒊 𝒙) = 𝒍𝒐𝒈𝒊 P(i)-
𝟏
𝒍𝒐𝒈𝒆
𝟐
𝒊
- 𝟏/𝟐 𝒙 −
PARAMETRIC APPROXIMATION(CONT.)
Therefore based on the distance used ,there are
variations in Bayesian distance classifier:
1. Mahalanobis distance
2. Euclidean distance
3. City block distance


Mahalanobis distance is the most reliable but
computationally intensive as compared to the
others.
DECISION FUNCTIONS FOR MINIMUM DISTANCE
CLASSIFIERS

The decision function for class i with mean 𝑚𝑖 is denoted as:
𝟏




𝒅𝒊 (x)=𝒙𝑻 𝒎𝒊 − 𝒎𝒊 𝑻, for i=1,2,3…
𝟐
The approach used here is to assign the instance to the classifier if the
distance between unknown sample and the mean vector is minimum ,
where
𝟏
Mean vector of a pattern class i : 𝒎𝒊 =
𝒙 ; i=1,2,….k
𝑵 𝒙∈𝒘𝒊 𝒊
The Euclidean distance to compute the distance between the unknown
instance x and the mean vector: 𝒅𝒊 = 𝒙 − 𝒎𝒊
𝟏
𝟐
Norm of 𝒅𝒊 ∶ 𝒙𝑻 𝒎𝒊 − 𝒎𝒊 𝑻 𝒎𝒊 ; for i=1,2…k
𝑤ℎ𝑒𝑟𝑒 𝑛𝑜𝑟𝑚 𝑖𝑠 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑎𝑠 𝒂 = 𝒂𝑻 𝒂
𝟏
𝟐
𝟏

Similarly norm for class j : 𝒅𝒋 : 𝒙𝑻 𝒎𝒋 − 𝒎𝒋 𝑻 𝒎𝒋 ; for j=1,2…k
𝟐
The decision boundary between the classes i & j can be calculated as
𝒅𝒊 (x)- 𝒅𝒋 (x).

This is equivalent to : 𝒙𝑻 𝒎𝒊 − 𝒎𝒊 𝑻 𝒎𝒊 − 𝒙𝑻 𝒎𝒋 + 𝒎𝒋 𝑻 𝒎𝒋

1.
2.
3.
For n=2,dividing function is a line.
For n=3,it’s a plane
For n>3 it is a hyper plane
𝟏
𝟐
𝟏
𝟐
MINIMUM RISK CLASSIFIER
A cost function called loss function is assigned to the classification
.
 In case of any error of misclassification or a risk ,a penalty is
assigned so that the risk can be minimized or avoided in future.
 The cost of the decision is based on the nature of the application in
which the classifier are used.
 The estimated cost or loss function is multiplied with the posterior
probabilities for taking the final decision of assigning a label for the
unknown instance.

The decision rule can be designed as follows:
𝜶𝟐
𝒊
𝜶𝟏
𝒋
 IF Loss( )xP( ) > Loss( )xP( ) : assign an instance x to the
𝒊
𝒙
𝒋
𝒙
class i
𝜶𝟐
𝒊
𝜶𝟏
𝒋
 IF Loss( )xP( ) < Loss( )xP( ) : assign the instance x to the
𝒊
𝒙
𝒋
𝒙
class j
Here, 𝛼1 𝑎𝑛𝑑 𝛼2 are the costs of the decisions.

BAYESIAN CLASSIFIER FOR MULTIPLE
FEATURES
Real world problems involve objects having multiple attributes.
 In this case a set of features is used as a feature vector
 So for k classes,

𝒙
P(𝒊 𝒙) =
𝑷 𝒊 𝑷( 𝒊 )
𝒌 𝑷 𝒋 𝑷(𝒙)
𝒋=𝟏
𝒋

For P(x) being Gaussian distribution given by:
P(x) =

𝟏
(𝟐𝝅)𝒅 𝐝𝐞𝐭
𝒆
𝟏
−𝟐[(𝒙−𝒎)𝑻 −𝟏(𝒙−𝒎)]
If more feature are involved mean becomes a mean vector and
, a covariance matrix.
NON PARAMETRIC STATISTICAL
METHODS
In this method the representative of every class is selected.
 The classification is performed by assigning each tuple to the class
to which it is more similar.

Let the Classes be {𝒄𝟏 , 𝒄𝟐 , … 𝒄𝒏 }
& Training dataset D has {𝒕𝟏 , 𝒕𝟐 , … 𝒕𝒏 }
The K-nearest neighbours problem is to assign 𝑡𝑖 to the class 𝑐𝑗
such that the similarity measure of (t,𝒄𝒋 ) is greater than or equal to
the similarity measure of (t,𝒄𝒊 ),where i ≠ j.
 The similarity measure can be obtained by using distance measures.


1.
2.
3.
ALGORITHM:
Choose the representative of the class . Normally, the center or
the centroid of the class is chosen as the representative of the
class.
Compare the test tuple and the center of each class.
Classify the test tuple to the appropriate class.
REGRESSION METHODS



Regression is one of the method used for numerical
prediction
Regression analysis models one or more independent
variables (results) and a dependent variable (input attributes).
e.g. - Fitting a line to a set of points:
It can be described as Y=𝑾𝟎 + 𝑾𝟏 𝒙 where 𝑾𝟎 & 𝑾𝟏 are the
weights of the regression coefficients.
 The coefficients can be found using method of least squares to fit a
line that minimizes the error between the actual data and the
estimate.
 If D is the training set:
𝑾𝟏 =
𝑫
𝒊=𝟏(𝒙𝒊
− 𝒙)((𝒚𝒊 − 𝒚)/
𝑫
𝒊=𝟏(𝒙𝒊
− 𝒙)𝟐
𝑾𝟎 =𝑦 − 𝑤𝑖 𝑥
Here 𝑥 𝑎𝑛𝑑 𝑦 are the mean values of the data x and y.
STRUCTURAL AND SYNTACTIC
CLASSIFIER ALGORITHM



Structural methods exploit the relationship
that exist among the basic elements of the
objects,
They use techniques such as graphs to
encode the objects and the problem of
recognition becomes a matching problem.
Syntactic methods( Grammer-based or
linguistic approach) use strings or small sets
of pattern primitives and grammatical rules
for recognizing the object.
SYNTACTIC CLASSIFIERS
The idea is to decompose the object in terms
of the basic primitives.
 The process of decomposing an object into a
set of primitives is called Parsing.
 The basic primitives can then be
reconstructed to the original object using
formal languages to check whether the
recognized pattern is obtained.
 Hence formal language theory plays an
important role in syntactic classification.

STAGES OF SYNTACTIC
CLASSIFIER:


1st phase:Training phase: The syntactic
classifier is given the training dataset of valid
strings of known objects . The patterns are
decomposed into basic patterns and the
Grammer necessary for combining the
primitives to reconstruct the original object
is identified in the training phase.
2nd phase:Testing stage : Here unknown
patterns are given into the Grammer of the
syntactic classification system. Each unknown
pattern is decomposed into the basic
primitives and checked using a parser.
SHAPE MATCHING ALGORITHMS



Assume the shapes A & B have shape numbers in the form of a string of
chain codes.
Let the strings represent the shape characteristics of the boundary of an
object.
By this assumption , the shapes have a similarity of 𝛼 if :
𝑆𝑗 𝐴 = 𝑆𝑗 𝐵 𝑓𝑜𝑟𝑗 = 4,6,8 … . 𝛼
𝑆𝑗 𝐴 ≠ 𝑆𝑗 𝐵 𝑓𝑜𝑟 𝑗 = 𝑘 + 2, 𝑘 + 4, . .
here j is the order

The similarities are recorded in a matrix called Similarity matrix.

Another way is to use the distance measure for shape matching.

The distance measure is given as the reciprocal of the similarity measure
1
, which is given as D(A,B) = where D(A,B) is the distance between two
2
shapes A and B and k is the degree of similarity.
STRING MATCHING ALGORITHMS:
Let there be two regions , a and b. Assume that they are coded
into two strings :
a={𝑎1 , 𝑎2 … 𝑎𝑛 }
b={𝑏1 , 𝑏2 … 𝑏𝑛 }
 Let us assume that 𝑎1 = 𝑎2 , 𝑏1 = 𝑏2 ,etc
 Let the position where there is no match ,that is 𝑎𝑘 ≠ 𝑎𝑘 , be 𝛼.
 Then the following two measures can be defined :
1. The number of symbol that do not match:
𝛽= max( 𝑎 , 𝑏 )-𝛼
where 𝑎 , 𝑏 are the lengths of the strings a and b. 𝛼is the number
of matches between these strings. 𝛽 = 0 if no symbols match.

2.
Degree of similarity R =
𝛼
=
𝛼
max( 𝑎 , 𝑏 )−𝛼
When the strings are the same , R=∞. The value of R is high when
there is a good match between the strings.
𝛽
STRUCTURAL METHODS: RULE
BASED ALGORITHMS






Tree search is a popular approach that uses rules for
classification.
The simplest rules would be IF(condition) and
THEN(conclusion)
IF part is called antecedent or precondition
THEN part is called rule consequent.
Decision rules are generated using a technique called
covering algorithm where the best attribute is chosen to
perform the classification based on the training data.
The algorithm chooses the best attribute that minimizes
the error and uses that attributes in generating a rule.
RULE BASED ALGORITHMS







1.
2.
3.
4.
In a decision tree ,every node can have only two children.
The root is specially designated node and all the other
intermediate nodes of the tree represent the rule conditions.
The leaves of the tree are classes that are assigned to the
instances.
The unknown object or instance features are taken and their
values are compared and validated with the conditions
represented sequentially in the internal nodes of the tree.
Tracing the path from the root to the assigned class gives
conditions that led to the classification of that instance.
For any tree classifier , the required feature is searched , the
search is continued till the instance is assigned to a class.
Some of the algorithms that are used :
Top down search
DFS
BFS
A* algorithms.
GRAPH BASED APPROACH
The graph-based approach is an extension of the
tree-based approach.
 Initially an object is modelled as a graph.
 Graph matching is then used to give the similarity
measure of the objects.
 Two graphs can be similar even if they are
structurally different..
 If there is a complete match , the match is declared
isomorphic otherwise it’s a dissimilar graph.

EVALUATION OF CLASSIFIER
ALGORITHMS
Some of the techniques used are:
Separate training sets
 This is one of the simplest methods for
testing the classifier.
 The dataset is seperated into 2 sets:
one of them is called training dataset
and the other is test dataset and is used
for testing the performance of the classifier.
1.
EVALUATION OF CLASSIFIER
ALGORITHMS(cont.)
2. k-fold cross validation:
 It is an improvement over the previous cross
validation method.
 The dataset is divided in to k datasets.
 Each time a classifier is tested,k-1 subsets
are together considered as training dataset
and remaining are called test dataset.
 The process is then repeated for k trials.
 The average value of k is 10
EVALUATION OF CLASSIFIER
ALGORITHMS(cont.)
3. Leave out cross validation
 Also called N-folding or jack-knifing
technique.
 In this method every instance is treated as a
dataset
 Then N classifiers are generated and each of
them is used to classify the single instance.
 This method is unsuitable for real world
problems because computation is intensive.
EVALUATION OF CLASSIFIER ALGORITHMS(cont.)
Method
Performance
Separate training sets
Predictive accuracy : 𝑁 where , C is
the number of instances correctly
classified &
N is the number of instances.
k-fold cross validation
Overall performance: Average error of
misclassification of the classifier across
all k trials.
Leave out cross validation
Predictive accuracy:
𝐶
𝒄𝒐𝒓𝒓𝒆𝒄𝒕𝒍𝒚 𝒄𝒍𝒂𝒔𝒔𝒊𝒇𝒊𝒆𝒅 𝒔𝒂𝒎𝒑𝒍𝒆𝒔
𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒊𝒏𝒔𝒕𝒂𝒏𝒄𝒆𝒔
Metrics of qualitative quantification
Description
Classification time
Time for constructing the model + Time for
classification of unknown instances
Robustness
Immunity of the classifier to noise or missing
data
Scalable
Able to handle large dataset
Goodness of fit
Quality of the model generated , as described
in confusion matrix
True positive rate(TP rate)
Sensitivity of classifier:
False positive rate(FP rate)
Specificity f the classifiers:
N=FP+TN
False negative rate(FN rate)
Probability of producing erroneous rate for
𝐹𝑁
positive instances: 𝑃 where P=TP+FN
True negative rate(TN rate)
Probability of producing erroneous rate for
𝑇𝑁
negative instances: 𝑁 where N=FP+TN
Positive predictive value(precision)
𝑇𝑃
𝑃
where P=TP+FN
𝐹𝑃
𝑃
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
where
Metrics of qualitative quantification
Description
Accuracy
Ability of a classifier to classify instances:
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
Negative predictive value
Error rate
Probability of an object not classified
𝑇𝑁
correctly: 𝑇𝑁+𝐹𝑁
𝐹𝑃 + 𝐹𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Graphical method for performance evaluation







The Receiver operating characteristic(ROC) graph is an effective
tool for visualization of a classifier performance as well as comparing the
performance of many classifiers.
It is a 2-D graph plot where x-axis is the FP rate and y axis is the TP
rate.
For any classifier the FP and TP rate can be plotted as an (x,y) value in
a graph.
To compare any two classifiers the points need to be compared.
The ROC curve is helpful in understanding the tuning process that
results in the best way of classification.
Area under the curve indicates the accuracy of the model: if the area is
one ,its it is perfect.
A classifier performance can be crudely compared with the best classifier
represented as (1,0) using a Euclidian Distance formula :
Eucledian Distance =
𝑭𝑷 𝒓𝒂𝒕𝒆𝟐 + (𝟏 − 𝑻𝑷𝒓𝒂𝒕𝒆)𝟐
The Euclidian distance ranges from 0(best classifier) to 1(worst classifier)
UNSUPERVISED LEARNING CLUSTERING
Clustering is a technique for partitioning a
group of images/data into meaningful disjoint
subgroups.
 Images that are similar to each other , group
themselves into a single cluster.
 All the images in a subgroup are similar to each
other and images across the clusters are
different.
 Clustering is an example of unsupervised
learning where there is no idea about the
classes or clusters prior to clustering.

METHODS FOR FINDING THE
SIMILARITY AND DISSIMILARITIES
OF THE IMAGES
Image clustering algorithm are based on
the notion of similarity or dissimilarity
between images.
 Proximity can be used to denote
similarity and dissimilarity together.
 Similarity measures are indicated by
distance functions.

DISTANCE MEASURES
Distance function characterize how close one
image is to another.
 For a distance function to be called a Metric
function it should fulfil Triangle equality:
1. D(i,j)≥0 for all i and j
2. D(i,j)=0 if i=j
3. D(i,j) = d(j,i) for all i & j
4. D(i,j) ≤ d(i,j)+ d(j,i),for all i,j,and k


The distance measures depend on the data type
of the objects involved in the clustering process.
Data types
Distance measure
Example
Nominal(categorical)
D(x,y) = (n-m)/m
Here m-number of matches between attributes of x and y
Identification number , label
number
Binary variables
D(x,y) = (n-m)/(n-s)
Here m-number of matches between attributes of x and y
and s is number of features absent in both images
Variables indicating occurrence or
non occurrence of an event
Quantitative measures
EUCLEDIAN DISTANCE: D(𝑶𝒊 , 𝑶𝒋 )= ( 𝑶𝒊𝒌 − 𝑶𝒋𝒌 )𝟐
Size , centroid , area
MANHATTEN AVERAGE DISTANCE:
𝟏
D(𝑶𝒊 , 𝑶𝒋 )𝑵 𝒏𝒌=𝟏 (𝑶𝒊𝒌 − 𝑶𝒋𝒌 )
MINKOWSKI DISTANCE:
D(x,y) =( 𝒙𝟏− 𝒚𝟏
𝒒
𝟏
+ 𝒙 𝟐 − 𝒚𝟐 𝒒 + ⋯ + 𝒙 𝒏 − 𝒚𝒏 𝒒 ) 𝒒
Ordinal or ranked variable
𝒓𝒊 − 𝟏
;
𝑴−𝟏
Where 𝑟𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑎𝑛𝑘 𝑎𝑛𝑑 𝑀 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑟𝑎𝑛k
If Grades ={S,B,A},inherent order
is present as S,B>A
Qualitative measure
Number of matches
Shape number
Interval and ratio variables
MINKOWSKI DISTANCE:
𝒁𝒊 =
D(x,y) =( 𝒙𝟏− 𝒚𝟏
𝒒
+ 𝒙 𝟐 − 𝒚𝟐 𝒒 + ⋯ + 𝒙 𝒏 − 𝒚𝒏 𝒒 )
𝟏
𝒒
When the difference measure is
meaningful
CLUSTERING ALGORITHMS
DIVISIVE METHODS
CLUSTERING
ALGORITHM
AGGLOMERATIVE
ALGORITHMS
HIERARCHICAL
CLUSTERING
PARTITIONAL METHODS
Divisive methods
HIERARCHICAL CLUSTERING
Hierarchical methods produce a recursive
partition set of objects and the results are
shown as Dendogram.
 These are subdivided as agglomerative
methods and divisive methods.
 The advantages of this method are:

There is no need for vector representation for each
object.
2. These algorithms are easy to understand and
interpret and are simple.
3. They normally yield the correct number of clusters
4. They are helpful in identifying the outliners
1.
DENDOGRAM
Dendogram for a grayscale image
Shown below
AGGLOMERATIVE ALGORITHMS
These treat each individual object as a cluster.
They are then merged with other clusters and
this process is continued to ultimately get a single
cluster.
 Stages:
1. Create a separate cluster for every data
instance.
2. Repeat the following steps till a single cluster is
obtained:
3. Determine the two most similar clusters using
similarity measures
4. Merge the two clusters into a single cluster
5. Choose a cluster formed by one of the 2 results
as final , if no more merging is possible


AGGLOMERATIVE ALGORITHMS(cont.)
One of the popular algorithm is : Singlelinkage algorithms
 It takes a single instance and merges it with a
cluster with which it is closer.
 This process is continued till no more
merging is possible

PARTITIONAL METHODS:
These are greedy approaches that are
used iteratively to obtain a single level of
partition.
 These produce locally optimal or
suboptimal solutions.
 One of the popular algorithm is: Kmeans algorithm

K-MEANS ALGORITHM

The algorithm for K-means algorithm is:
The user has to specify the number of clusters
initially.
2. Then the algorithm generates the required number of
random clusters , called initial cluster centres.
3. It then assigns each point to the clusters if the
distance between the point and the cluster is
minimum.
4. Then the centroid of the cluster is computed and
iteration is used next till there is no change in the
centroid value. Otherwise choose a new mean and
repeat the process.
1.
K-means algorithm
K-means cluster evaluation for a matrix :
X = [ randn (100,2)+ones(100,2); randn (100,2)-ones(100,2)];
CHARACTERISTICS OF A GOOD CLUSTER
Efficiency of the clustering algorithm
 Ability to handle missing data in the dataset
 Ability to handle noisy and outlier data
 Ability to handle different attribute types
 Scale-invariance
 Ability to obtain good clusters on all attribute
values/methods
 Consistency.

CLUSTER EFFICIENCY MEASURES:
Cluster cohesion : Its a measure of how
similar the elements are to each other in
a cluster
 Cluster separation : Its a measure to
indicate how distinct a cluster is from
other clusters

METRICES OF CLUSTER EVALUATION
metrics
measure
PURITY
1
(Sum of majority elements of all
𝑇𝑜𝑡𝑎𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠
clusters)
PRECISION AND RECALL
It measures the extent of a class
present in a cluster
SIMILARITY BASED MEASURES
Contingency table
JACCARD COEFFICIENT
RAND COEFFICIENT
𝐴
(𝐵 + 𝐶 + 𝐷)
𝐴+𝐵
𝐴+𝐵+𝐶+𝐷
REFERENCES



Digital image processing: S. Sridhar
Digital image processing : Gonzalez woods
& Edd
MATHWORKS: http://www.mathworks.in
END