Download CLASSIFICATION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multinomial logistic regression wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
A Report
on
CLASSIFICATION
Submitted by
Sri Harsha Allamraju
ID : 23
Table of Contents
1. Definition of Classification
2. Examples for Classification
3. Classification Techniques
3.1
Regression
3.2
Distance
3.2.1 K Nearest Neighbor
3.3
Decision Trees
3.3.1 ID3
3.3.2 C4.5
3.3.3 CART
4. References
1. Definition of Classification
Data mining is a new technology with great potential for extraction of hidden
predicative information from large databases. These tools help predict future trends
and behaviors. This helps businesses to make proactive and knowledge driven
decisions.
Classification is the process of dividing a dataset into mutually exclusive groups such
that the members of each group are as “close” as possible to one another, and different
groups are “far” as possible from one another, where distance is measured with respect
to specific variable(s) you are trying to predict.
Classification is used to predict group membership for data instances.
It is a procedure in which individual items are placed into groups based on quantitative
information on one or more characteristics inherent in the items and based on a
training set of previously labeled items.
For ex:
A typical classification problem is to divide a database of companies into groups that
are as homogeneous as possible with respect to a creditworthiness variable with values
“good” and “bad”.
A more formal way of defining classification is as follows:
“Given a database D = { t1, t2, ..... tn} and a set of classes C = { C1, C2,.....Cn}, the
Classification problem is to define a mapping f:D->C where each ti is assigned to one
class.”
2. Examples for Classification
●
●
●
●
●
●
Teacher Classifying student's grades as A,B,C,D or F.
Identifying mushrooms as poisonous or edible.
Predict when a river will flood.
Identify individuals with credit risk.
Speech recognition.
Pattern recognition.
In the case of grading, the teacher might decide to give A+ to students with total score
95 or above, A grade to students in-between 90 and 95 and so on. By doing this, the
teacher is classifying or segregating students on the basis of total score. This is a dayto-day example for classification.
3. Classification techniques
There are various classification techniques namely regression, distance, decision trees,
rules and neural networks.
3.1Regression
Regression is the oldest and most well known statistical techniques that the data
mining community utilizes. It is assumed that data fits in a pre-defined function.
The general form of regression can be
y = c0 + c1x1 +........... cnxn + e;
where e is the error
Regression can be either linear, non-linear, logistic etc.,
How can we apply regression techniques for classification? Regression can be used to
divide the area of regions. These areas represent classifications.
In regression, future values are predicted based on past values. Linear regression
assumes that linear relationship exists.
3.2 Classification using distance.
In this classification technique, items are placed in class to which they are oldest.
One must determine the distance between an item and a class. The algorithm so used is
KNN (K-nearest neighbor)
3.2.1. K Nearest Neighbor
In this classification technique, the training set includes classes.
Examine K near items to be classified. New items placed in class with most number of
close items.
An object is classified by a majority vote of its neighbors, with the object being assigned
the class most common amongst its k nearest neighbors. k is a positive integer,
typically small. If k = 1, then the object is simply assigned the class of its nearest
neighbor. In binary (two class) classification problems, it is helpful to choose k to be an
odd number as this avoids difficulties with tied votes.
3.3 Decision Trees
A decision tree is a flow chart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and leaf
nodes represent classes or class distributions.
A decision tree indicating whether or not a customer will buy a computer.
How ever there can be different decision trees for the same dataset. For ex:
This gives rise to couple of issues concerning the generation of decision trees namely:
●
●
●
●
●
●
●
Choosing the splitting attribute.
Ordering of splitting attribute.
Splits.
Tree Structure.
Stopping Criteria.
Training Data
Pruning.
The various Algorithms used for computing decision trees are: ID3, C4.5 and CART.
3.3.1. ID3
Compute Information gain for each attribute that can be tested.
Branch on the attribute that has the most gain.
Repeat on each branch until no improvement, fully classified leaf.
Example:
Name
Kristi
Jim
Maggie
Matha
Stephny
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
1.6
2
1.9
1.88
1.7
1.85
1.6
1.7
2.2
2.1
1.8
1.95
1.9
1.8
1.75
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Medium
Medium
Medium
Medium
Medium
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
Medium
Medium
Tall
Medium
Medium
From the data, with output1 classification (4/15) are short, (8/15) are medium and (3/15) are
tall.
Entropy of starting set =
(4/15)log(15/4) + (8/15)log(15/8) + (3/15)log(15/3)
= 0.4384
Choosing gender, there are nine tuples that are F and six that are M. The entropy of the subset
that are F is
(3/9)log(9/3) + (6/9) log(9/6) = 0.2764
Whereas for the M subset, it is
(1/6) log(6/1) + (2/6)log(6/2) + (3/6) log(6/3) = 0.4392
Then we calculate the weighted sum of these two entropies which is
(9/15)(0.2764) + (6/15) (0.4329) = 0.34152
Gain in entropy using the gender attribute =
0.4384 – 0.34152 = 0.09688
In case of height, we split it into different ranges and calculate the entropy.
The gain in entropy in case of height attribute is 0.4384 – (2/15) (0.301) = 0.3983.
Since, ID3 chooses attributes with higher gain, we have in this case the height attribute which
has the higher gain. Therefore, the tree is first split based on the height attribute.
3.3.2.C4.5
C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5
is an extension of Quinlan's earlier ID3 algorithm. The decision tree's generated by C4.5 can
be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.
C4.5 is an improved version of ID3. When decision tree is built, missing data is ignored. To
classify a record with a missing attribute value, the value for that item can be predicted based
on what is known about tha attribute value of the other records. The basic idea is to divide the
data into ranges based on the attribute values of that item found in training sample.
C4.5 uses Gain Ratio instead of Gain.
Looking back at the “Heights” dataset, to calculate gain Ratio for gender split, we first
calculate entropy associated with the split.
H(9/15,6/15)
= (9/15)log(15/9) + 6/15 log(15/6) =
=
0.292
Gender Ratio = 0.09688/(0.292) = 0.332
Similarly we calculate the gender ratio for heights and split based in attribute having largest
Gain Ratio.
3.3.3.CART
CART as it stands for Classification and regression trees uses entropy and c
Creates Binary trees. (only 2 children are created) Splitting is performed based on best split
point
Formula to choose split point, s, for node t:
L and R are used to indicate left and right subtrees of current node. PL,PR probability that a
tuple in the training set will be on the left or right side of the tree defined as (tuples in subtree)/(tuples in training set)
4. References
Data Mining: Introductory and Advanced Topics by Margaret H. Dunham (both
textbook and notes)
Wikipedia
Data Mining lecture notes of Prof. Christopher Hazard of NCSU.
http://databases.about.com/od/datamining/g/classification.htm
http://en.wikipedia.org/wiki/Statistical_classification