Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING : CLASSIFICATION Classification : Definition Classification is a supervised learning. Uses training sets which has correct answers (class label attributes). A model is created by running the algorithm on the training data. Test the model. If accuracy is low, regenerate the model, after changing features,reconsidering samples. Identify a class label for the incoming new data. Applications: Classifying credit card transactions as legitimate or fraudulent. Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil. Categorizing news stories as finance, weather, entertainment, sports, etc. Classification: A two step process Model construction: describing a set of predetermined classes. Each sample is assumed to belong to a predefined class, as determined by the class label attribute. The set of samples used for model construction is training set. The model is represented as classification rules, decision trees, or mathematical formula. Model usage: for classifying future or unknown objects. Estimate accuracy of the model. The known label of test sample is compared with the classified result from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set. If the accuracy is acceptable, use the model to classify data samples whose class labels are not known. Model Construction: Classification Algorithms Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 yes Professor 5 yes Assistant Prof 7 yes Tenured? Classification techniques: Decision Tree based Methods Rule-based Methods Neural Networks Bayesian Classification Support Vector Machines Algorithm for decision tree induction: Basic algorithm: Tree is constructed in a top-down recursive divideand-conquer manner. At start, all the training examples are at the root. Attributes are categorical (if continuous-valued, they are discretized in advance). Examples are partitioned recursively based on selected attributes. Example of Decision Tree: Training Dataset age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Output: A Decision Tree age for“buys_computer” <=30 <=30 age? <=30 student? overcast 30..40 yes >40 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating buys_comp high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no credit rating? no yes excellent fair no yes no yes Advantages of decision tree based classification: Inexpensive to construct. Extremely fast at classifying unknown records. Easy to interpret for small-sized trees. Accuracy is comparable to other classification techniques for many simple data sets. Enhancements to basic decision tree induction: Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and Potential Problem: Over fitting: This is when the generated model does not apply to the new incoming data. » Either too small of training data, not covering many cases. » Wrong assumptions Over fitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors How to avoid Over fitting: Two ways to avoid over fitting are – Pre-pruning Post-pruning Pre-pruning: Stop the algorithm before it becomes a fully grown tree. Stop if all instances belong to the same class. Stop if no. of instances is less than some user specified threshold Post-pruning: Grow decision tree to its entirety. Trim the nodes of the decision tree in a bottom-up fashion. If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree. Bayesian Classification Algorithm: Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds Training dataset for Bayesian Classification: Class: C1:buys_compute r= ‘yes’ C2:buys_compute r= ‘no’ age income student credit_ratingbuys_computer buys_computer income student credit_rating <=30 no high high no no fair fair no <=30 excellent no high high no no excellent no high high no no fair fair yes 30…40 yes >40 yes mediummedium no no fair fair yes >40 low yes yes yes low fair fair yes low excellent no >40 low yes yes excellent no Data sample low excellent yes 31…40 low yes yes excellent yes X =(age<=30, mediummedium no no fair fair no no Income=medium, <=30 low fair fair yes <=30 low yes yes yes Student=yes Credit_rating= mediummedium yes yes excellent yes >40 fair yes Fair) mediummedium yes yes fair excellent yes <=30 yes mediummedium no no excellent yes 31…40 excellent yes high high yes yes fair fair yes 31…40 yes mediummedium no no excellent no >40 excellent no Advantages & Disadvantages of Bayesian Classification: Advantages : Easy to implement Good results obtained in most of the cases Disadvantages: Due to assumption there is loss of accuracy. Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc ,Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Bayesian Classifier Conclusion: Training data is an important factor in building a model in supervised algorithms. The classification results generated by each of the algorithms (Naïve Bayes, Decision Tree, Neural Networks,…) is not considerably different from each other. Different classification algorithms can take different time to train and build models. Mechanical classification is faster References: www.google.com http://www.thearling.com www.mamma.com www.amazon.com http://www.kdnuggets.com C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. Thank you !!!