Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification and Prediction CS 5331 by Rattikorn Hewett Texas Tech University 1 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 2 1 Tasks n Classification & Prediction ¨ ¨ describe data with respect to a target class/output create, from a data set, a model that n ¨ use the model to n n n describes a target class in terms of features (or attributes or variables) of the class or system generating the data classify an unknown target class predict an unknown output Example Applications ¨ ¨ ¨ credit approval medical diagnosis treatment effectiveness analysis What about image recognition or customer profiling? 3 Classification vs. Prediction Labeled Data Algorithm class label or predicted output Model Data with unknown class label or output n Comparisons: Data Model Use of the model Typical unknown values Classification Prediction labeled (by known target class) classifier/classification model classify unknown label categorical labeled (by known output) predictive model predict unknown output continuous/ordered 4 2 Three-step process n Model Construction (Training/Learning) Build a model that describes training data labeled with known classes/outputs ¨ A model could be of various forms: rules, decision trees, formulae ¨ n Model Evaluation (Testing) ¨ Use a labeled testing data set to estimate model accuracy n n n The known label of test sample is compared with the classified result from the model Accuracy = % test cases correctly classified by the model Model Usage (Classification) ¨ Use the model with acceptable accuracy to classify or predict unknown 5 Example inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 inst 1 2 3 4 5 Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Credit history Bad Good U U U Debt Collateral Income H H L L L L L L L H H H H H Debt H L L M L None None None None None Ok None Ok None Ok None None None None L M M L H H L H H H L M H M Collateral Income Ok None None None None L M M L H Classification Algorithm Risk H H M H L L H M L L H M L H Risk H H M H L Model Construction Classifier (Model): 1. If income = L then Risk = H 2. If income = H and Credit History = U then Risk = L ……. Model Evaluation Accuracy = ? inst 1 2 Credit history U Bad Debt H L Model Usage Collateral Income Ok None L M Risk ? ? 6 3 Machine learning context no Training Training Data Testing Testing Data Labeled Data Algorithm Model Is the model sufficiently accurate? yes Model: Data with unknown class label or output Input Data • classifier • predictive model Output Model class label or predicted output 7 Supervised vs. Unsupervised n Supervised learning Class labels and # classes are known Labeled data: the training data (observations, measurements, features etc.) are accompanied by labels indicating the class of the observations ¨ Goal is to describe class/concept in terms of features/observations ¨ ¨ à classification/prediction n Unsupervised learning ¨ ¨ ¨ Class labels are unknown and # classes may not be known Unlabeled data: the training data does not have class labels Goal is to establish the existence of classes or concepts or clusters in the data à clustering 8 4 Terms n n n Machine learning aims to abstract regularities from a given data set ¨ Mining ~ learning ~ induction (reasoning from specifics to general) A training set consists of training instances ¨ The terms instances, tuples (or rows), examples, samples, and objects are used synonymously A testing set is independent of a training set ¨ If a training set is used for testing during learning process, the resulting model will overfit the training data à the model could incorporate anomalies in the training data that are not present in the overall sample population à model that may not represent the population (poor generalization) 9 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 10 5 Evaluation n n A resulting model (classifier) is tested against a testing data set to measure its ¨ Accuracy or ¨ Error rate = # misclassified test cases/#test cases in % Experimental methods ¨ ¨ ¨ ¨ Holdout method Random subsampling Cross validation Bootstrapping 11 Holdout method n Randomly partition a data set into two independent sets: a training set and a test set (typically 2/3 and1/3) ¨ The training set is used to derive a classifier, whose accuracy is estimated using the test set Training Data Classifier (Model) Estimate Accuracy Data Testing Data n Use with data set that has large number of samples 12 6 Random sub-sampling n n Repeat holdout method k times Overall accuracy = average of the accuracies in each iteration Classifier Training Data (Model) Testing Data • • • Training Data Classifier (Model) Testing Data What happens when the data requires supervised discretization (e.g., Fayad&Irani’’s alg.)? Data Data Estimate Accuracy • • • Average accuracy Estimate Accuracy 13 Cross validation n k-fold cross validation ¨ ¨ ¨ Divide a data set into k approximately equal subsets of samples Use k-1 subsets for training set and the rest for testing set Obtain a classifier/accuracy from each pair of training and testing sets (k possibilities) Train Data Test Data Data • • • Train Data Test Data Classifier (Model) • • • Classifier (Model) Estimate Accuracy • • • Average accuracy Estimate Accuracy 14 7 Cross validation & Bootstrapping n Stratified cross validation – the folds are stratified so that the class distribution of the samples in each fold is approximately the same as that in the initial data n Stratified 10-fold Cross-validation is recommended when a sample set is of medium size n When sample set size is very small à Bootstrapping ¨ Leave-one-out sample training set with replacement 15 Improving Accuracy n Use a committee of classifiers and combine the results of each classifier n Two basic techniques: ¨ Bagging ¨ Boosting 16 8 Bagging n Generate independent training sets by random sampling with replacement from the original sample set n Construct a classifier for each training set using the same classification algorithm n To classify an unknown sample x, each classifier returns its class prediction, which counts as one vote n The Bagged Classifier C* counts the votes and assigns x to the class with the “most” votes 17 Bagging Test Data Set Training Data Set Training Sample 1 Classifier 1 Training Sample 2 Classifier 2 Training Sample n Classifier n Aggregated Classifier, C* by Majority of the votes Resulting classes 18 9 Boosting n n n Assign weight to each training sample A series of classifiers is learned. After classifier Ct is learned ¨ Calculate the error and re-weight the examples based on the error. ¨ The weights are updated to allow the subsequent classifier Ct+1 to “pay more attention” to the misclassification errors made by Ct The final boosted classifier C* combines the votes of each classifier, where the weight of each classifier’s vote is a function of its accuracy on the training set 19 Boosting Test Data Set Weighted Training Data Set Classifier 1 Weighted Training Data Set Classifier 2 Weighted Training Data Set Classifier 3 Classifier n-1 Weighted Training Data Set Classifier n Aggregated Classifier, C* by weighted votes Resulting classes Weighted are based on resulting accuracy 20 10 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 21 Issues on classification/prediction n How do we compare different techniques? ¨ n No single method works well for all data sets Some evaluation criteria: ¨ ¨ ¨ ¨ ¨ Accuracy – indicates generalization power Speed – time to train/learn a model and time to use the model Robustness – tolerant to noise and incorrect data Scalability – acceptable run time to learn as data size and complexity grow Interpretability – comprehensibility and insight from the resulting model 22 11 Issues on classification/prediction n Any alternatives to the accuracy measure? ¨ ¨ ¨ Sensitivity = t-pos/pos Specificity = t-neg/neg Accuracy = sensitivity (pos/(pos+neg)) + specificity (neg/(pos+neg) = (t-pos + t-neg)/(pos+neg) Actual pos = # positive samples neg = # negative samples t-pos = # of true positives t-neg = # of true negatives f-pos = # of false positives f-neg = # of false negatives cancer ~ cancer cancer Predicted t-pos f-pos ~ cancer f-neg t-neg 23 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 24 12 Input/Output inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Income Debt H H L L L L L L L H H H H H Collateral Income None None None None None Ok None Ok None Ok None None None None L M M L H H L H H H L M H M Risk H H M H L L H M L L H M L H Training data set L M H Risk Credit history U debt H H Risk Bad Good H Risk M Risk H Credit history U Bad Good L Risk M Risk L M Risk L Risk Decision Tree Model Does the tree and the data set cover the same information? 25 Decision Tree (DT) Classification n Typically has two phases: ¨ ¨ n Tree construction (Induction) ¨ ¨ n Tree construction Tree pruning Initially, all the training examples are at the root Recursively partition examples based on selected attributes Tree pruning ¨ ¨ Remove branches that may reflect noise/outliers in the data Give faster or more accurate classification 26 13 Decision Tree Induction Basic Algorithm n n n Data are categorical Tree starts a single node representing all data Recursively, ¨ ¨ ¨ select a split-attribute node that best separates sample classes If all samples for a given node belong to the same class à the node becomes a leaf labeled with class label If no remaining attributes on which samples may be partitioned Or there are no samples for the attribute value branch à a leaf is labeled with majority class label in samples 27 Example {1,2,....,14} Income inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Debt Collateral Income H H L L L L L L L H H H H H None None None None None Ok None Ok None Ok None None None None L M M L H H L H H H L M H M Risk H H M H L L H M L L H M L H L {1,4,7,11} M {2,3,12,14} H {5,6,8,9,10,13} H Risk 28 14 Example {1,2,....,14} Income inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Debt Collateral Income H H L L L L L L L H H H H H None None None None None Ok None Ok None Ok None None None None L M M L H H L H H H L M H M Risk H H M H L L H M L L H M L H L M {1,4,7,11} H {2,3,12,14} {5,6,8,9,10,13} H Risk Credit history U Bad Good {5,6} L Risk M Risk {8} {9,10,13} L Risk 29 Example {1,2,....,14} Income inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Debt Collateral Income H H L L L L L L L H H H H H None None None None None Ok None Ok None Ok None None None None L M M L H H L H H H L M H M Risk H H M H L L H M L L H M L H L M {1,4,7,11} H Risk U {2,3} {2,3,12,14} H {5,6,8,9,10,13} Credit history Credit history Bad {14} H Risk Good {12} M Risk U Bad Good {5,6} L Risk {8} M Risk {9,10,13} L Risk 30 15 Example {1,2,....,14} Income inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Debt Collateral Income H H L L L L L L L H H H H H None None None None None Ok None Ok None Ok None None None None L M M L H H L H H H L M H M Risk H H M H L L H M L L H M L H L M {1,4,7,11} {2,3,12,14} H Risk debt H {2} H Risk {5,6,8,9,10,13} Credit history Credit history U {2,3} H Bad {14} H Risk Good {12} M Risk U Bad Good {5,6} L Risk {8} M Risk L {3} {9,10,13} L Risk M Risk 31 Selecting split-attribute n n Based on a measure called Goodness function Different algorithms use different goodness functions: ¨ Information gain (e.g. in ID3/C4.5 [Quinlan, 94]) n n Assume categorical attribute values - modifiable to continuous-valued attributes Choose attribute with highest information gain ¨ Gini n n n Index(e.g., in CART, IBM IntelligentMiner) Assume continuous-valued attributes – modifiable to categorical attributes Assume several possible split values for each attribute – may need clustering to find possible split values Choose attribute with smallest gini index ¨ Others e.g., gain ratio, distance-based measure, etc. 32 16 Information gain n Choose attribute with highest information gain to be the test attribute for current node n Highest information gain = greatest entropy reduction à least randomness in sample partitions à min information needed to classify the sample partitions 33 Information gain n Let S be a set of labeled samples, then the information needed to classify a given sample is Ent ( S ) = −∑ pi log 2 pi where pi is a probability that a sample from S is in class i. Here pi = |Ci|/|S|, where Ci = a sample set i∈C from S whose class label is i n Expected information needed to classify a sample if it is Probability that a sample from S whose partitioned into subsets by A. A attribute value is i∈ dom(A) I ( A) = ∑ i∈dom ( A ) n Si S Ent ( Si ) where Si is a partition of a sample set whose A attribute Information gain: value is i, and dom(A) = all possible values of A Gain(A) = Ent(S) – I(A) 34 17 inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Example Goal - to find attribute A with highest gain(A) on S S = {1, 2, 3, …, 14} - each number represent a sample row C = {H Risk, M Risk, L Risk} {1,2,4,7,11,14} {3,8,12} {5,6,9,10,13} Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad Debt Collateral Income H H L L L L L L L H H H H H None None None None None Ok None Ok None Ok None None None None Risk L M M L H H L H H H L M H M H H M H L L H M L L H M L H Ent(S) = - p(H Risk)log2(p(H Risk)) - p(M Risk)log2(p(M Risk)) - p(L Risk)log2(p(L Risk)) = - 6/14log2(6/14) - 3/14log2(3/14) - 5/14log2(5/14) = 1.531 To find Gain(Income) on S, Dom(Income) = {H Income, M Income, L Income} {5,6,8,9,10,13} {2,3,12,14} {1,4,7,11} I(Income) = p(H income)Ent(H income) + p(M income)Ent(M income) + p(L income) Ent(L income) 6/14 4/14 4/14 H Risk M Risk L Risk Ent(H income) = Ent({5,6,8,9,10,13}) = - 0log2(0)- 1/6log2(1/6)- 5/6log2(5/6) = .65 Ent(M income) = Ent({2,3,12,14}) Ent(L income) = Ent({1,4,7,11}) = - 2/4log2(2/4)- 2/4log2(2/4)- 0log2(0) = 1 = - 4/4log2(4/4)- 0log2(0)- 0log2(0) 0 = I(income) = 6/14(0.65) + 4/14(1) + 4/14(0) = 0.564 Gain(income) = Ent({1,..14}) – I(income) = 1.531 – 0.564 = 0.967 35 inst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Example Goal - to find attribute A with highest gain(A) on S S = {1, 2, 3, …, 14} - each number represent a sample row C = {H Risk, M Risk, L Risk} {1,2,4,7,11,14} {3,8,12} {5,6,9,10,13} Gain(credit history) = 0.266 Gain(debt) = 0.581 Gain(collateral) = 0.756 à Select income as a root Debt Income Gain(income) = Ent(S) – I(income) = 1.531 – 0.564 = 0.967 Similarly, Credit history Bad U U U U U Bad Bad Good Good Good Good Good Bad L H H L L L L L L L H H H H H None None None None None Ok None Ok None Ok None None None None Risk L M M L H H L H H H L M H M H H M H L L H M L L H M L H S = {1, 2, 3, …, 14} M {1,4,7,11} Collateral Income {2,3,12,14} H {5,6,8,9,10,13} H Risk Recursive process on S = {2,3,12,14} To pick next attribute Credit history Recursive process on S = {5,6,8,9,10,13} To pick next attribute Credit history 36 18 Gini Index n For a data set T containing examples from n classes, define n 2 gini(T ) = 1− ∑ p j j =1 n where pj is the relative frequency of class j in T If a data set T is split into two subsets T1 and T2, the gini index of the split data contains examples from n classes is defined as gini split (T ) = n |T 1| |T | gini(T 1) + 2 gini(T 2) N N The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). 37 DT induction algorithm Characteristics: n n Greedy search algorithm ¨ Make optimal choice at each step – select the “best” split-attribute for each tree node Top-down recursive divide-and-conquer manner ¨ From root to leaf ¨ Split node to several branches and each branch, recursively run algorithm to build a subtree 38 19 DT Construction Algorithm Primary Issues: n Spilt criterion – for selecting split-attribute ¨ n Different algorithms use different goodness functions: information gain, gini index etc. Branching scheme – how many sample partitions ¨ n Stopping condition – when to stop further splitting ¨ n Binary branches (gini index) vs. multiple branches (information gain) Fully-grown vs. stopping early Labeling class ¨ Node is labeled with the most common class 39 From Trees to Rules Income L n Represent the knowledge in the form of IF-THEN rules n Create one rule for each path from the root to a leaf n ¨ Each attribute-value pair along a path forms a conjunction ¨ The leaf node holds the class prediction Rules are easier for humans to understand M H Risk H Credit history U debt H Bad Good H Risk Bad Good L Risk M Risk L H Risk M Risk Credit history U M Risk L Risk Example: IF income = “M” AND credit-history = “U” AND debt = “H” THEN risk = “H” IF income = “M” AND credit-history = “Good” THEN risk = “M” 40 20 Overfitting & Tree Pruning n A tree constructed may overfit the training examples due to noise or too small a set of training data à Poor accuracy for unseen samples n Approaches to avoid overfitting ¨ Prepruning: Stop growing the tree early—do not split a node if this would result in the goodness measure falling below a threshold n Difficult to choose an appropriate threshold ¨ Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees then decide which is the “best pruned tree” Combined Postpruning is more expensive than prepruning but more effective ¨ n 41 Postpruning n Merge a subtree into a leaf node ¨ If accuracy without splitting > accuracy with splitting, then don’t spilt ⇒ replace the subtree with a leaf node, label it with a majority class n Techniques: ¨ Cost n n ¨ Min n n complexity pruning – based on expected error rates For each non-leaf node in the tree, compute expected error rate if the subtree was and was not pruned – combined error rates from each branch using weights on sample frequency ratios Requires an independent test sample to estimate the accuracy of each of the progressively pruned trees description length pruning – based on # of tree encoding bits The “best pruned tree” minimizes the # of encoding bits No test sample set is required 42 21 Postpruning (cont) The correct tree size can be determined by: n Using a separate E.g., CART test set to evaluate the pruning n Using all the data for training but ¨ applying a statistical test (e.g., chi-square) to evaluate the pruning E.g., C4.5 n Using ¨ minimum description length (MDL) principle halting growth of the tree when the encoding is minimized E.g., SLIQ, SPRINT 43 Tree induction Enhancement n Allow for continuous-valued attributes ¨ Dynamically define a discrete value that partitions the continuous values into a discrete set of intervals n n Handle missing attribute values ¨ n sort continuous attribute values, identify adjacent values with different target classes, generate candidate thresholds midway, and select the one with max gain Assign the most common value OR probability to each possible values Reduce tree fragmentation, testing repetition , and subtree replication ¨ Attribute construction - create new attributes based on existing ones that are sparsely represented 44 22 Classification in Large Databases n Classification: a classical problem extensively studied by statisticians and machine learning researchers n Goal: classifying data sets with millions of examples and hundreds of attributes with reasonable speed n Decision tree induction seems to be a good candidate ¨ ¨ ¨ ¨ n relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can be used to generate SQL queries for accessing databases has comparable classification accuracy with other methods Primary issue: scalability ¨ most algorithms assume data can fit in memory – need to swap data in/out of main and cache memories 45 Scaling up decision tree methods n Incremental tree construction [Quinlan, 86] ¨ ¨ n Data reduction [Cattlet, 91] – still a main memory algorithm ¨ n Using partial data to build a tree Testing of additional examples and examples that are misclassified are used to rebuild the tree interactively Reducing data size by sampling and discretization Data partition and merge [Chan and Stolfo, 91] ¨ ¨ ¨ Building trees for each partition of the data Merging trees into one tree But … resulting accuracy is reported to be reduced 46 23 Recent efforts in Data Mining Studies n SLIQ (EDBT’96 — Mehta et al.) & SPRINT (VLDB’96 — J. Shafer et al.) ¨ ¨ ¨ n PUBLIC (VLDB’98 — Rastogi & Shim) ¨ n presort disk-resident data sets (that are too large to fit in main memory) handle continuous-valued attributes define a new data structure to facilitate tree construction integrates tree splitting and tree pruning RainForest ¨ (VLDB’98 — Gehrke, Ramakrishnan & Ganti) A framework for scaling decision tree induction that separates scalability from quality criteria 47 SLIQ n n n Tree building uses a data structure: attribute lists, a class list Each attribute has an associated attribute list, indexed by record ID Each tuple is a linkage from each attribute list to a class list entry and to a leaf node of the decision tree RID 1 2 credit-rate good fair credit-rate good fair RID 1 2 Attribute lists age 30 28 age 28 30 buy-car? yes no RID 2 1 RID 1 2 buy-car? yes no Class list node 48 24 SLIQ Efficiency is contributed by the fact that: n Only class list and current attribute list are memory-resident ¨ n Class list is dynamically modified during tree construction Attribute lists are disk-resident, presorted (reduce cost to evaluate and indexed (eliminate resorting data) Use fast subsetting algorithm to determine split-attributes splitting) n ¨ n If number of possible subsets exceeds threshold, use greedy search Use inexpensive MDL-based tree pruning 49 SPRINT n Proposes a new data structure: attribute list eliminating SLIQ’s requirement on memory-resident class list RID 1 2 credit-rate good fair Attribute 1 credit-rate good fair n class buy-car? yes no buy-car? yes no Attribute 2 RID 1 2 age 28 30 class buy-car? no yes RID 2 1 Outperforms SLIQ when the class list is too large to fit main memory ¨ n age 30 28 but needs a hash tree to connect different joins which could be costly with a large training set Designed to ease parallelization 50 25 PUBLIC n n Integrates splitting and pruning Observation & idea: ¨ ¨ n a large portion of the tree ends up being pruned Can we use a top-down approach to predetermine this and stop growing it earlier? How? Before expanding a node, compute a lower bound estimation on the cost subtree rooted at the node ¨ If a node is predicted to be pruned (based on the cost estimation), return it as a leaf, otherwise go on splitting it ¨ 51 RainForest n A generic framework that separates the scalability aspects from the criteria that determine the quality of the tree ¨ Applies to any decision tree induction algorithm ¨ Maintains an AVC-set (attribute, value, class label) ¨ n Reports a speedup over SPRINT 52 26 Data Cube-Based DT Induction n Integration of generalization with DT induction – once the tree is derived, use a concept hierarchy to generalize n ¨ Generalization at low-level concepts à large, bushy trees ¨ Generalization at high-level concepts à lost interestingness Cube-based multi-level classification ¨ Relevance analysis at multi-levels 53 Visualization in Classification DBMiner - Presentation of classification rules 54 27 Visualization in Classification SGI/MineSet 3.0 Decision tree visualization 55 Visualization in Classification Interactive Visual Mining by Perception-Based Classification (PBC) 56 28 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 57 Bayesian classification Basic features/advantages: n n n Probabilistic classification ¨ Gives explicit probability that a given example belongs to a certain class ¨ Predict multiple hypotheses, weighted by their probabilities Probabilistic learning ¨ each training example can incrementally increase or decrease the probability of the hypothesis Probabilities are theoretical-supported ¨ Provide standard measures to facilitate decision-making and comparison with other methods 58 29 Bayesian-based approaches n Basic approaches ¨ Naïve Bayes Classification ¨ Bayesian network learning n Both are based on Bayes Theorem 59 Bayes Theorem n X - an observed training example H - a hypothesis that X belongs to a class C Goal of classification: determine P(H|X), the probability that H holds given X Bayes Theorem: P(H | X ) = P( X | H )P(H ) n Informally, this can be written as n n n P( X ) Posterior probability of H conditioned on X Prior Probabilities posterior = likelihood x prior / evidence (RHS can be estimated) n Practical issues: ¨ ¨ require initial knowledge of many probabilities significant computational cost 60 30 Naïve Bayes Classifier n a data example X, is an n-dimensional vector à n attributes n Naïve assumption of class conditional independence ¨ ¨ For a given class label, attribute values are conditionally independent of one another No dependence relationships between attributes P( X | C i) = n n n ∏ P( x k | C i) k =1 where Ci is a class label The assumption reduces the computation cost, only count the class distribution. Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)P(Ci) – Why? and How? 61 Naïve Bayes Classifier Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)P(Ci) – Why? The classifier predicts class of X to be the one that gives maximum P(Ci| X) for all possible classes Ci’s By Bayes theorem, P(Ci | X ) = P( X | Ci ) P(Ci ) P( X ) Since P(X) is the same for all classes, only P(X|Ci)P(Ci) needs to be maximized 62 31 age Example X = (age = <30, income =medium, student=yes, credit_rating=fair) Compute P(X|Ci) for each class <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student high no high no high no medium no low yes low yes low yes medium no low yes medium yes medium yes medium no high yes medium no credit_rate buys_car fair no excellent no fair yes fair yes fair yes excellent no excellent yes fair no fair yes fair yes excellent yes excellent yes fair yes excellent no P(X|Ci) : P(X|buys_car=“yes”) = P(age=“<30” | buys_car=“yes”) P(income=“medium” | buys_car=“yes”) P(student=“yes” | buys_car=“yes”) P(credit_rating=“fair” | buys_car=“yes”) = 2/9 x 4/9 x 6/9 x 6/9 = 0.0444 P(X|buys_car=“no”) = 3/5 x 2/5 x 1/5 x 2/5 = 0.019 P(X|Ci)*P(Ci ) : P(X|buys_car=“yes”) * P(buys_car=“yes”) = 0.0444*9/14 = 0.028 P(X|buys_car=“no”) * P(buys_car=“no”) = 0.019*5/14 = 0.007 X belongs to class “buys_car=yes” 63 Naïve Bayesian Classifiers n n P(xk|Ci) for a continuous-valued attribute, Ak can be estimated from a normal density function Advantages : ¨ n Easy to implement & good results obtained in most of the cases Disadvantages Loss of accuracy when the class conditional independence assumption is violated ¨ In practice, dependencies exist among variables, e.g., ¨ n n n hospitals: patients: Profile: age, family history etc ; symptoms etc. Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? à Bayesian Belief Networks 64 32 Bayesian Networks n n Bayesian belief network (Bayes net, Bayesian network) allows class conditional dependencies to be represented A graphical model of dependency among the variables ¨ ¨ Gives a specification of joint probability distribution Two components: DAG (Directed Acyclic Graph) CPT (Conditional Probability Table) representing P(X | parents(X)) for each node X Y X Z q Nodes: random variables q Links: dependency q X,Y are the parents of Z Y is the parent of P q No dependency between Z and P q Has no loops or cycles P An Example Family History 65 CPT of Lung Cancer (LC) Smoker (FH,S) (FH,~S) (~FH,S) (~FH,~S) LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 P(LC=yes|FH=yes and S = no) = 0.5 LungCancer PositiveXRay Emphysema Dyspnea The joint probability of a tuple (z1, ..,zn) corresponding to variables Z1,..,Zn is n P(z1,...,zn ) = ∏ P( zi | Parents(Z i)) i =1 Bayesian Belief Networks € 66 33 Classification with Bayes Nets n The classification process can return a probability distribution for the class attribute instead of a single class label n Learning Bayes Nets ¨ ¨ ¨ ¨ Causal learning/discovery is not the same as Causal Reasoning Output: A Bayes Net representing regularities in an input data set Network structures may be known or unknown: n structure known + all variables observable: training to learn CPTs n structure known + some hidden variables: use gradient descent - analogous to neural network learning n structure unknown + all variables observable: Search through the model space to reconstruct structure n structure unknown + all hidden variables: No good algorithms known Two basic approaches: Bayesian learning and constraint-based learning 67 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 68 34 Classification & function approximation Classification: predicts categorical class labels Typical Applications: {credit history, salary} à Credit approval (Yes/No) {Temp, Humidity} àRain (Yes/No) Mathematically, n n x ∈ X = {0,1}n , y ∈ Y = {0,1} h: X →Y y = h( x ) We want to estimate function h 69 Linear Classification n x x x x x x x x x ooo o o o o x o o o o o o Binary Classification problem ¨ ¨ n The data above the line belongs to class ‘x’ The data below line belongs to class ‘o’ Discriminative Classifiers, e.g., ¨ ¨ SVM (support vector machine) Perceptron (Neural net approach) 70 35 Discriminative Classifiers n n Advantages ¨ prediction accuracy is generally high as compared to Bayesian methods – in general ¨ robust, works when training examples contain errors ¨ fast evaluation of the learned target function Bayesian networks are normally slow Criticism ¨ long training time ¨ difficult to understand the learned function (weights) Bayesian networks can be used easily for pattern discovery ¨ not easy to incorporate domain knowledge easy in the form of priors on the data or distributions 71 Neural Networks n Analogy to Biological Systems n Massive Parallelism allowing for computational efficiency n The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule 72 36 A Neuron x0 w0 x1 w1 xn bias: acts as a threshold to θ-k f ∑ output y wn Input weight vector x vector w n vary the activity weighted sum Activation (Sigmoid) function The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping, e.g., n y = sign( ∑ wi xi + θ k ) i =0 73 Multilayer Neural Networks Input vector Input nodes x1 1 w14 Hidden nodes x2 2 3 w15 w34 4 5 w35 w46 Output nodes x3 Three-layer neural network • Feed-forward – no feedback • Fully connected w56 6 Output vector Multilayer feed-forward networks of linear threshold functions with enough hidden units can closely approximate any function 74 37 Defining a network structure Before training, … n Determine # units in input layer, hidden layer(s), and output layer ¨ n Normalize input values to speedup learning ¨ ¨ n No clear rules on the “best” number of hidden layer units Continuous values à range between 0 and 1 Discrete values à Boolean values Accuracy depends on both network topology and initial weights 75 Training Neural Networks n n Goal ¨ Find a set of weights that makes almost all the tuples in the training data classified correctly (or acceptable error rates) Steps ¨ Initialize weights with random values ¨ Feed the input tuples into the network one by one ¨ For each unit n Compute the net input to the unit as a linear combination of all the inputs to the unit n Compute the output value using the activation function n Compute the error n Update the weights and the bias ¨ Repeat until terminal condition is satisfied 76 38 Backpropagation Learning ~ searching using a gradient descent method Searches for a set of weights n n ¨ that minimizes the mean squared distance between the network’s class prediction and the actual class label of the samples A learning rate helps avoid local minimum problem n ¨ ¨ ¨ If the rate is too low à learning at a slow pace If the rate is too high à oscillation between solutions Rules of thumb: set it to 1/t where t = # iterations through the training set so far 77 Backpropagation x1 x2 1 w14 θ4 x3 2 n ¨ 3 w15 w34 4 5 w35 w46 w56 6 θ6 θ5 Initialization Weights & biases: small random numbers (e.g., from -1 to 1) w14 w15 w24 w25 w34 w35 θ4 θ5 θ6 For each training example x =(x1, x2 , x3)=(1, 0, 1) n Propagate the input forward I j = ∑ wij Oi + θ j i Sigmoid activation function Oj = 1 1+ e −I j I4 = w14O1 + w24O2 + w34O3 + θ4 where I1 = x1 =1 I5 similarly Compute O4 , O5 I6 = w46O4 + w56O5 + θ6 Then, compute O6 78 39 Backpropagation x1 x2 1 x3 2 w14 θ4 n Err j = O j (1 − O j )(T j − O j ) 3 w15 w34 4 5 Backpropagate the error Err6 = O6 (1- O6) (T6 – O6), where T6 is a given target class label = 1 w35 Err j = O j (1 − O j )∑ Errk w jk θ5 k w46 w56 6 Err5 = O5 (1- O5) (Err6w56) Err4 = O4 (1- O4) (Err6w46) θ6 n Update weights & biases wij = wij + (l ) Err j Oi θ j = θ j + (l) Err j 79 Backpropagation x1 x2 1 w14 θ4 x3 2 3 w15 w34 4 5 w35 w46 w56 6 θ6 θ5 n Update weights & biases wij = wij + (l ) Err j Oi θ j = θ j + (l) Err j ¨ Case updating – for each training example ¨ Epoch updating – for each epoch (an iteration through a training set) 80 40 Backpropagation n n Repeat the training process (i.e., propagation forward etc.) Until ¨ All the changes of wij in the previous epoch is below some specified threshold, or ¨ % samples misclassified in the previous epoch is below a threshold, or ¨ A pre-specified number of epochs has expired 81 Pruning and Rule Extraction n A major drawback of Neural Net learning is its model is hard to interpret n Fully connected network is hard to articulate à network pruning ¨ Remove weighted links that do not decrease classification accuracy of the network n Extracting rules from a trained pruned network ¨ Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy ¨ Derive rules relating activation value and output using the discretized activation values to enumerate the output ¨ Similarly, find the rule relating the input and activation value ¨ Combine the above two to have rules relating the output to input 82 41 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 83 Other Classification Methods n Support vector machine Classification-based association K-nearest neighbor classifier n Case-based reasoning n Soft computing approach ¨ Genetic algorithm ¨ Rough set approach ¨ Fuzzy set approaches n n 84 42 Support vector machine(SVM). n Basic idea ¨ find n n the best boundary between classes ¨ build classifier on top of them Boundary points are called support vectors Two basic types: ¨ ¨ Linear SVM Non-linear SVM 85 SVM Small Margin Large Margin Support Vectors 86 43 Optimal Hyper plane: separable case Hyper plane: n n n n Class 1 and class 2 are separable. xT β + β 0 = 0 Crossed points are support vectors – points that maximize the margin between the two classes X Given a training set of N pairs of xi with label yi X X Goal is to maximize C subject to a constraint: X yi ( xiT β + β 0 ) > C , i = 1,..., N C unit vector bias 87 Non-separable case n When the data set is non-separable ¨ xT β + β 0 = 0 assign weight to each support vector changing the constraint to be yi (xiT β + β0 ) > C(1− ξ i ), X N where ξ* X ∀i, ξi > 0, ∑ ξi < const. X i =1 X C 88 44 General SVM n n n No good optimal linear classifier Can we do better? à A non-linear boundary (circled points are support vectors) Idea: Map original problem space into a larger space so that the boundary in the new space is linearly separable ¨ Use Kernel (do not always exist) to compute distances in the new space (could be infinite-dimensional) ¨ 89 Example n When the data is not linearly separable ¨ Project the data to high dimensional space where it is linearly separable and then we can use linear SVM (0,1) + + - + -1 0 +1 (0,0) + (1,0) 90 45 Performance of SVM n For general support vector machine E[ P(error )] ≤ E[(# of support vectors)/ (# training samples)] n SVM has been very successful in lots of applications 91 SVM vs. Neural Network n SVM n Neural Network ¨ Relatively new concept ¨ Quite Old ¨ Nice Generalization properties ¨ ¨ Hard to learn – learned in batch mode using quadratic programming techniques Generalizes well but doesn’t have strong mathematical foundation ¨ Can easily be learned in incremental fashion ¨ To learn complex functions – use multilayer perceptron (non-trivial) ¨ Using kernels can learn very complex functions 92 46 Open problems of SVM n How to choose appropriate Kernel function ¨ Different Kernel gives different results, although generally better than hyper planes n For very large training set, support vectors might be of large size. Speed thus becomes a bottleneck n An optimal design for multi-class SVM classifier 93 SVM Related Links n http://svm.dcs.rhbnc.ac.uk/ n http://www.kernel-machines.org/ n C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. n SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light 94 47 Other Classification Methods n Support vector machine Classification-based association K-nearest neighbor classifier n Case-based reasoning n Soft computing approach ¨ Genetic algorithm ¨ Rough set approach ¨ Fuzzy set approaches n n 95 Association-Based Classification n Several methods ¨ ARCS: Association rule + clustering (Lent et al’97) n It beats C4.5 in (mainly) scalability and also accuracy ¨ Associative classification: (Liu et al’98) n It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label ¨ CAEP (Classification by aggregating emerging patterns) (Dong et al’99) n n Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another Mine EPs based on minimum support and growth rate 96 48 Instance-Based Methods n Instance-based learning: ¨ Delay generalization until new instance must be classified model (“lazy evaluation” or “lazy learning”) ¨ Note: “eager learning” (e.g., decision tree) constructs generalization model before classifying a new sample n Thus, lazy learning gives fast training but slow classification n Examples of lazy learning: ¨ ¨ ¨ n k-nearest neighbor approach Locally weighted regression Case-based reasoning Requires efficient indexing techniques 97 k-nearest neighbor classifiers n n A training sample = point in n-dimensional space When an unknown sample x is given for classification Search for k training samples (k “nearest neighbors”) that are closest to x E.g., “closeness” ~ Euclidean distance ¨ For discrete value class, x is assigned to the most common class among its k “nearest neighbors” ¨ For continuous-valued class (e.g., in prediction), x is assigned to the average value of the continuous-valued class associated with the k nearest neighbors ¨ n Potentially expensive when number of k-nearest neighbors is large 98 49 k-nearest neighbor classifiers (cont) n Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to 1 their distance to the query point xq - e.g., w ≡ d ( xq , xi )2 n giving greater weight to closer neighbors ¨ Similarly, for real-valued target functions ¨ n n Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. ¨ To overcome it, axes stretch or elimination of the least relevant attributes. 99 Case-Based Reasoning (CBR) n Similar to k-nearest neighbor n But …A sample is a “case” (complex symbolic description) not a point in Euclidean space Idea: given a new case to classify ¨ ¨ ¨ n n Instance-based learner (lazy learner) & Analyze similar samples CBR searches a training case that is identical à gives the stored solution Otherwise find cases with similar components à modify stored solutions to be a solution for a new case Requires background knowledge, tight coupling between case retrieval, knowledge-based reasoning and problem solving Challenges: ¨ ¨ Similarity metrics Efficient techniques for indexing training cases and combining solutions 100 50 Genetic Algorithms (GA) n Based on ideas of natural evolution Create an initial population consisting of randomly generated rules - represented by bit strings ¨ Generate rule offsprings by applying genetic operators ¨ n n ¨ Form a new population containing the fittest rules n ¨ n Crossover - swap substrings from pairs of rules Mutation - randomly selected bits are inverted The fitness of a rule is measured by its classification accuracy on a set of training examples Repeat process until the population is “evolved” to satisfy fitness threshold GA facilitates parallelization and popular for solving optimization problems 101 Rough Set Approach n n Rough sets are used for “roughly” define equivalent classes in the training data set A rough set for a given class C is approximated by ¨ ¨ n a lower approximation (samples certain to be in C) an upper approximation (samples indescribable as not be in C) Rough sets can also be used for relevance analysis ¨ Finding the minimal subsets of attributes is NP-hard Rectangles ~ equivalent classes 102 51 Fuzzy Set n Fuzzy logic ¨ n uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) Fuzzy logic allows classification of high level abstraction ¨ Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated For a given new sample, more than one fuzzy value may apply ¨ Each applicable rule contributes a vote for membership in the categories ¨ Typically, the truth values for each predicted category are summed/combined ¨ 103 Outline n n n n Overview ¨ Tasks ¨ Evaluation ¨ Issues Classification techniques ¨ Decision Tree ¨ Bayesian-based ¨ Neural network ¨ Others Prediction techniques Summary 104 52 What Is Prediction? n Prediction is similar to classification ¨ First, construct a model then use it to predict unknown value ¨ Most deal with modeling of continuous-valued functions ¨ Major method for prediction is regression n Linear and multiple regression n Non-linear regression 105 Regression Models n Linear regression: Data are modeled using a straight line e.g., ¨ ¨ ¨ ¨ Y = α + β X, where Y = response variable, X = predictor variable Assume variance of Y to be constant α and β are regression coefficients are to be estimated using the least squares method (errors between actual data and estimated line are minimized) n Multiple regression: Extended linear model to more than one predictor value, i.e., Y = b0 + b1 X1 + b2 X2 n Nonlinear regression: many can be transformed into linear Log-linear models: approximate discrete probability distributions n 106 53 Summary n Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) n Classification is probably one of the most widely used data mining techniques with a lot of extensions n Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic n Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc.. 107 References n C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. n L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, 1984. C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2): 121-168, 1998. n n P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995. n U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf., pages 601-606, AAAI Press, 1994. n J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New York, NY, August 1998. J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree Construction . In SIGMOD'99 , Philadelphia, Pennsylvania, 1999 n 108 54 References (cont) n M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering (RIDE'97), Birmingham, England, April 1997. n B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining. Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98) New York, NY, Aug. 1998. n W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules, , Proc. 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001. n J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994. n M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. (EDBT'96), Avignon, France, March 1996. 109 References (cont) n T. M. Mitchell. Machine Learning. McGraw Hill, 1997. n S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998 n J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. n J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996. n R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998. n J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996. n S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. n S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997. 110 55