Download 4、Classification and Prediction (6hrs)

4、Classification and Prediction (6hrs) 4.1 What is classification? What is prediction? 4.2 Issues regarding classification and prediction 4.3 Classification by decision tree induction 4.4 Bayesian classification 4.5 Classification by back propagation 4.6 Support Vector Machines (SVM) 4.7 Prediction 4.8 Accuracy and error measures 4.9 Model selection 4.10 Summary Key Points：Definitions; Bayesian Classification ; Decision Trees Notes: More details on related algorithms. Q&A: 1. What is the difference between Classification and prediction Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. In prediction, rather than predicting class labels, the main interest (usually) is missing or unavailable data values. (Han & Kamber) So, although classification is actually the step of finding the models, the goal of both methods is to predict something about unknown data objects. The difference is that in classification that “something” is the class of objects, whereas in prediction it is the missing data values. 2. Briefly outline the major steps of decision tree classification. Answer: The major steps are as follows: • The tree starts as a single root node containing • If the tuples all of the training tuples. are all from the same class, then the node becomes a leaf, labeled with that class. • Else, an attribute selection method criterion. Such a method may using a heuristic information gain or gini index) into individual classes. to select the The attribute and may also indicate described splitting either splitting or statistical measure (e.g., way to separate the tuples “best” criterion consists of a splitting a split-point or a splitting subset, as below. • Next, a test is called to determine the at outcomes the node is labeled the node. of the with the A branch splitting splitting criterion, is grown from the criterion and the which serves as node to each of the tuples are partitioned accordingly. • The algorithm recurses to create a decision tree for the tuples at each partition. 3. Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of tuples to evaluate pruning? Answer: The decision tree built may overfit the training data. branches, There could be too many some of which may reflect anomalies in the training data or outliers. Tree pruning addresses this issue of overfitting the the reliable least results in a more compact accurate The in its classification drawback that it may not original to the pruning of the tree. statistical measures). generally and reliable decision tree that is faster and more of data. be representative of the If the pruned separate set of tuples tree means there would not to evaluate training tuples pruning used to create the pruned Furthermore, using a separate set of tuples are less tuples is are skewed, then using them be a good indicator of the While this is considered a drawback not be so in data by removing This classification accuracy. evaluate (using of using a separate set of tuples decision tree. evaluate tree’s branches data due to noise to use for creation in machine mining due to the availability of larger data and testing learning, sets. to it may 4. Given a decision converting the the resulting then tree, decision you have tree to rules rules, or (b) pruning converting the pruned the option and then the decision tree to of (a) pruning tree and rules. What advantage does (a) have over (b)? Answer: If pruning a subtree, (b). we would remove the subtree However, with method precondition of it. (a), if pruning a rule, we may remove any The latter is less restrictive. 5. It is important to calculate complexity completely with method of the the decision tree worst-case computational algorithm. Given data set D, the number of attributes n, and the number of training tuples |D|, show that the computational cost of growing a tree is at most n × |D| × log(|D|). Answer: The worst-case scenario occurs attributes as possible before being able maximum depth compute attribute). The of the the number to to classify each group is log(|D|). attribute selection total the partitions). Thus, tree when we have of tuples At each measure use as many of tuples. level we will have O(n) times (one on each level is |D| (adding The to per over all the computation per level of the tree is O(n × |D|). Summing over all of the levels we obtain O(n × |D| × log(|D|)). 6. Given a 5 GB data set with 50 attributes (each containing 100 distinct values) and 512 MB of main memory in your laptop, outline an efficient method that constructs decision trees in such large data sets. Justify your answer by rough calculation of your main memory usage. Answer: We will use the RainForest algorithm for this problem. class labels. the tree. The most memory required To compute database once and the construct the size of each AVC-list AVC-list The 100 × C × 50, which will easily fit into The computation of other be smaller because there of scans we can compute parallel. With will be for AVC-set AVC-set is 100 × C . AVC-sets Assume there for the root root 50 attributes. size of the AVC-set will be less attributes available. way but The is then 512MB of memory for a reasonable is done in a similar of node, we scan the for each of the total for the are C they C. will To reduce the number the AVC-set for nodes at the same level of the tree in such small AVC-sets per node, we can probably fit the level in memory. 7. Compare the classification (e.g., advantages and decision versus lazy classification tree, disadvantages of eager Bayesian, neural network) (e.g., k-nearest neighbor, case-based reasoning). Answer: Eager classification is faster at classification than lazy classification because it constructs a generalization model before receiving any new tuples to classify. Weights can be assigned to attributes, which can improve classification accuracy. Disadvantages of eager classification are that it must commit to a single hypothesis that covers the entire instance space, which can decrease classification, and more time is needed for training. Lazy classification uses a richer hypothesis space, which can improve classification accuracy. It requires less time for training than eager classification. A disadvantages of lazy classification is that all training tuples need to be stored, which leads to expensive storage costs and requires efficient indexing techniques. Another disadvantage is that it is slower at classification because classifiers are not built until new tuples need to be classified. Furthermore, attributes are all equally weighted, which can decrease classification accuracy. (Problems may arise due to irrelevant attributes in the data.) 8. What is association-based classification? Why is association-based classification able to achieve higher classification accuracy than a classical decision-tree method? Explain how association-based classification can be used for text document classification. Answer: Association-based classification is a method generated and associations and analyzed between class labels. where association rules are We first search for strong for use in classification. frequent patterns (conjunctions of attribute- value pairs) Using such strong Association-based classification associations we classify new examples. can achieve higher accuracy than decision tree because it overcomes the constraint of decision consider rules one attribute at that combine multiple For a only text the data and uses very we can model items that correspond to do stemming and which confidence termk =0.9]. for classification, a new document arrives → classi support and confidence that matches combination of rules as in CMAR. each document as words.) (We can We also find frequent patterns and output rules of the form term1 , term2 , ..., rule with highest high to terms. remove stop add the document class to the transaction. We then When trees, attributes. document classification, transaction containing preprocess a time, a classical [sup =0.1, conf we can apply the the document, or apply a 9. The support vector machine (SVM) is a highly accurate classification method. slow processing tuples. However, when SVM classifiers suffer from training with a large set of data Discuss how to overcome this difficulty and develop a scalable SVM algorithm for efficient SVM classification in large datasets. Answer: We data can sets in Proc. Mining use using micro-clustering technique the SVM 2003 ACM with SIGKDD (KDD’03), pages SVM (CB-SVM) hierarchical clusters” Int. Conf. 306-315, Aug. method is described in “Classifying large by Yu, Yang, and Han, Knowledge Discovery and Data 2003 [YYH03]. A as follows: 1. Construct the microclusters using a CF-Tree 2. Train 3. Decluster 4. Repeat the SVM training with the additional entries. 5. Repeat the above until an SVM on the centroids entries Cluster-Based (Chapter 7). of the microclusters. near the boundary. convergence. 10. What is boosting ? State why it may improve the accuracy of decision tree induction. Answer: Boosting We are is a technique given a set S used of s tuples. to help classifier accuracy. For iteration t, where t = 1, 2, . . . , T , a training set St is sampled with replacement from S. within Create that training set. improve a classifier, Ct Assign weights to the tuples from St . After Ct is created, update the weights of the tuples will have a a greater constructed. Ct+1 . This so that the tuples causing classification probability of being selected for the will help improve the accuracy of the Using this technique, each classifier should have greater its predecessor. error next classifier next classifier, accuracy than The final boosting classifier combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 4、Classification and Prediction (6hrs)