Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Report on CLASSIFICATION Submitted by Sri Harsha Allamraju ID : 23 Table of Contents 1. Definition of Classification 2. Examples for Classification 3. Classification Techniques 3.1 Regression 3.2 Distance 3.2.1 K Nearest Neighbor 3.3 Decision Trees 3.3.1 ID3 3.3.2 C4.5 3.3.3 CART 4. References 1. Definition of Classification Data mining is a new technology with great potential for extraction of hidden predicative information from large databases. These tools help predict future trends and behaviors. This helps businesses to make proactive and knowledge driven decisions. Classification is the process of dividing a dataset into mutually exclusive groups such that the members of each group are as “close” as possible to one another, and different groups are “far” as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. Classification is used to predict group membership for data instances. It is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items and based on a training set of previously labeled items. For ex: A typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values “good” and “bad”. A more formal way of defining classification is as follows: “Given a database D = { t1, t2, ..... tn} and a set of classes C = { C1, C2,.....Cn}, the Classification problem is to define a mapping f:D->C where each ti is assigned to one class.” 2. Examples for Classification ● ● ● ● ● ● Teacher Classifying student's grades as A,B,C,D or F. Identifying mushrooms as poisonous or edible. Predict when a river will flood. Identify individuals with credit risk. Speech recognition. Pattern recognition. In the case of grading, the teacher might decide to give A+ to students with total score 95 or above, A grade to students in-between 90 and 95 and so on. By doing this, the teacher is classifying or segregating students on the basis of total score. This is a dayto-day example for classification. 3. Classification techniques There are various classification techniques namely regression, distance, decision trees, rules and neural networks. 3.1Regression Regression is the oldest and most well known statistical techniques that the data mining community utilizes. It is assumed that data fits in a pre-defined function. The general form of regression can be y = c0 + c1x1 +........... cnxn + e; where e is the error Regression can be either linear, non-linear, logistic etc., How can we apply regression techniques for classification? Regression can be used to divide the area of regions. These areas represent classifications. In regression, future values are predicted based on past values. Linear regression assumes that linear relationship exists. 3.2 Classification using distance. In this classification technique, items are placed in class to which they are oldest. One must determine the distance between an item and a class. The algorithm so used is KNN (K-nearest neighbor) 3.2.1. K Nearest Neighbor In this classification technique, the training set includes classes. Examine K near items to be classified. New items placed in class with most number of close items. An object is classified by a majority vote of its neighbors, with the object being assigned the class most common amongst its k nearest neighbors. k is a positive integer, typically small. If k = 1, then the object is simply assigned the class of its nearest neighbor. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids difficulties with tied votes. 3.3 Decision Trees A decision tree is a flow chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions. A decision tree indicating whether or not a customer will buy a computer. How ever there can be different decision trees for the same dataset. For ex: This gives rise to couple of issues concerning the generation of decision trees namely: ● ● ● ● ● ● ● Choosing the splitting attribute. Ordering of splitting attribute. Splits. Tree Structure. Stopping Criteria. Training Data Pruning. The various Algorithms used for computing decision trees are: ID3, C4.5 and CART. 3.3.1. ID3 Compute Information gain for each attribute that can be tested. Branch on the attribute that has the most gain. Repeat on each branch until no improvement, fully classified leaf. Example: Name Kristi Jim Maggie Matha Stephny Bob Kathy Dave Worth Steven Debbie Todd Kim Amy Wynette Gender F M F F F M F M M M F M F F F Height 1.6 2 1.9 1.88 1.7 1.85 1.6 1.7 2.2 2.1 1.8 1.95 1.9 1.8 1.75 Output1 Short Tall Medium Medium Short Medium Short Short Tall Tall Medium Medium Medium Medium Medium Output2 Medium Medium Tall Tall Medium Medium Medium Medium Tall Tall Medium Medium Tall Medium Medium From the data, with output1 classification (4/15) are short, (8/15) are medium and (3/15) are tall. Entropy of starting set = (4/15)log(15/4) + (8/15)log(15/8) + (3/15)log(15/3) = 0.4384 Choosing gender, there are nine tuples that are F and six that are M. The entropy of the subset that are F is (3/9)log(9/3) + (6/9) log(9/6) = 0.2764 Whereas for the M subset, it is (1/6) log(6/1) + (2/6)log(6/2) + (3/6) log(6/3) = 0.4392 Then we calculate the weighted sum of these two entropies which is (9/15)(0.2764) + (6/15) (0.4329) = 0.34152 Gain in entropy using the gender attribute = 0.4384 – 0.34152 = 0.09688 In case of height, we split it into different ranges and calculate the entropy. The gain in entropy in case of height attribute is 0.4384 – (2/15) (0.301) = 0.3983. Since, ID3 chooses attributes with higher gain, we have in this case the height attribute which has the higher gain. Therefore, the tree is first split based on the height attribute. 3.3.2.C4.5 C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision tree's generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. C4.5 is an improved version of ID3. When decision tree is built, missing data is ignored. To classify a record with a missing attribute value, the value for that item can be predicted based on what is known about tha attribute value of the other records. The basic idea is to divide the data into ranges based on the attribute values of that item found in training sample. C4.5 uses Gain Ratio instead of Gain. Looking back at the “Heights” dataset, to calculate gain Ratio for gender split, we first calculate entropy associated with the split. H(9/15,6/15) = (9/15)log(15/9) + 6/15 log(15/6) = = 0.292 Gender Ratio = 0.09688/(0.292) = 0.332 Similarly we calculate the gender ratio for heights and split based in attribute having largest Gain Ratio. 3.3.3.CART CART as it stands for Classification and regression trees uses entropy and c Creates Binary trees. (only 2 children are created) Splitting is performed based on best split point Formula to choose split point, s, for node t: L and R are used to indicate left and right subtrees of current node. PL,PR probability that a tuple in the training set will be on the left or right side of the tree defined as (tuples in subtree)/(tuples in training set) 4. References Data Mining: Introductory and Advanced Topics by Margaret H. Dunham (both textbook and notes) Wikipedia Data Mining lecture notes of Prof. Christopher Hazard of NCSU. http://databases.about.com/od/datamining/g/classification.htm http://en.wikipedia.org/wiki/Statistical_classification